I have tried almost everything I know to web scrape from the following link https://static-content.springer.com/esm/art:10.1038/nplants.2016.167/MediaObjects/41477_2016_BFnplants2016167_MOESM277_ESM.pdf, Supplementary table 8, from page 26. I have not managed to do it with
datapasta
rvest
and
read.table(text="copy paste")
and I would like to have your input on how to web scrape complex tables from online. Any help or suggestions is appreciated
If you can not open the link please write a comment and I ll find an alternative
CodePudding user response:
Here a possible solution using pdftools. Note as_tibble is not necessary I only use it for pretty printing.
COMMENT: Looking at the pdf, Supplementary table 8 is at page 27 not 26, so i don't know if you want table 7 from page 26 or table 8 from page 27 (the code does the second). Nevertheless, edit the code at your convenience. Same is valid for the matches
regexp: it matches the lines of the mentioned table.
library(pdftools)
#> Using poppler version 22.04.0
url <- "https://static-content.springer.com/esm/art:10.1038/nplants.2016.167/MediaObjects/41477_2016_BFnplants2016167_MOESM277_ESM.pdf"
lines <- pdf_text(url) |> strsplit("\n") |> unlist() # suppresWarnings if you want
from <- grep("^Supplementary Table 8" , lines)
to <- grep("^Supplementary Table 9" , lines)
headers <- lines[seq(from, to)][grep(" gene id" , lines[seq(from, to)])[1]] |>
trimws() |> strsplit(" ") |> unlist()
matches <- regexec(
"(\\S{9}) (\\d \\.\\d |Inf) (\\d \\.\\d |Inf) (\\d \\.\\d |Inf) (\\S ) (.*)",
lines[seq(from, to)] )
table <- as.data.frame(do.call(rbind, regmatches(lines[seq(from, to)], matches) |> sapply("[", -1)))
colnames(table) <- headers
print(tibble::as_tibble(table))
#> # A tibble: 43 × 6
#> `gene id` A.thaliana C.hirsuta foldChange Name Description
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AT2G23340 336.63 770.89 2.29 DEAR3 ethylene-responsive tran…
#> 2 AT4G16610 41.56 151.80 3.65 AT4G16610 C2H2-like zinc finger pr…
#> 3 AT1G26260 62.80 162.01 2.58 CIB5 transcription factor bHL…
#> 4 AT5G04840 143.62 461.98 3.22 AT5G04840 bZIP protein
#> 5 AT5G66350 42.91 131.64 3.07 SHI Lateral root primordium-…
#> 6 AT5G65510 40.84 1166.91 28.57 AIL7 AINTEGUMENTA-like 7 prot…
#> 7 AT5G57390 191.57 756.34 3.95 AIL5 AP2-like ethylene-respon…
#> 8 AT5G56270 235.78 515.99 2.19 ATWRKY2 putative WRKY transcript…
#> 9 AT5G51990 0.00 5.45 Inf CBF4 dehydration-responsive e…
#> 10 AT5G46880 651.22 1337.28 2.05 HB-7 homeobox-leucine zipper …
#> # … with 33 more rows