Home > OS >  Webscrape complex tables using R
Webscrape complex tables using R

Time:01-31

I have tried almost everything I know to web scrape from the following link https://static-content.springer.com/esm/art:10.1038/nplants.2016.167/MediaObjects/41477_2016_BFnplants2016167_MOESM277_ESM.pdf, Supplementary table 8, from page 26. I have not managed to do it with

datapasta

rvest

and

read.table(text="copy paste") 

and I would like to have your input on how to web scrape complex tables from online. Any help or suggestions is appreciated

If you can not open the link please write a comment and I ll find an alternative

CodePudding user response:

Here a possible solution using pdftools. Note as_tibble is not necessary I only use it for pretty printing.

COMMENT: Looking at the pdf, Supplementary table 8 is at page 27 not 26, so i don't know if you want table 7 from page 26 or table 8 from page 27 (the code does the second). Nevertheless, edit the code at your convenience. Same is valid for the matches regexp: it matches the lines of the mentioned table.

library(pdftools)
#> Using poppler version 22.04.0

url <- "https://static-content.springer.com/esm/art:10.1038/nplants.2016.167/MediaObjects/41477_2016_BFnplants2016167_MOESM277_ESM.pdf"

lines <- pdf_text(url) |> strsplit("\n") |> unlist()   # suppresWarnings if you want

from <- grep("^Supplementary Table 8" , lines)
to   <- grep("^Supplementary Table 9" , lines)

headers <-  lines[seq(from, to)][grep("  gene id" , lines[seq(from, to)])[1]]  |> 
  trimws() |> strsplit("   ") |> unlist()

matches <- regexec(
  "(\\S{9})  (\\d \\.\\d |Inf)  (\\d \\.\\d |Inf)  (\\d \\.\\d |Inf)  (\\S )  (.*)",
  lines[seq(from, to)] )

table <- as.data.frame(do.call(rbind, regmatches(lines[seq(from, to)], matches) |> sapply("[", -1)))
colnames(table) <- headers

print(tibble::as_tibble(table))
#> # A tibble: 43 × 6
#>    `gene id` A.thaliana C.hirsuta foldChange Name      Description              
#>    <chr>     <chr>      <chr>     <chr>      <chr>     <chr>                    
#>  1 AT2G23340 336.63     770.89    2.29       DEAR3     ethylene-responsive tran…
#>  2 AT4G16610 41.56      151.80    3.65       AT4G16610 C2H2-like zinc finger pr…
#>  3 AT1G26260 62.80      162.01    2.58       CIB5      transcription factor bHL…
#>  4 AT5G04840 143.62     461.98    3.22       AT5G04840 bZIP protein             
#>  5 AT5G66350 42.91      131.64    3.07       SHI       Lateral root primordium-…
#>  6 AT5G65510 40.84      1166.91   28.57      AIL7      AINTEGUMENTA-like 7 prot…
#>  7 AT5G57390 191.57     756.34    3.95       AIL5      AP2-like ethylene-respon…
#>  8 AT5G56270 235.78     515.99    2.19       ATWRKY2   putative WRKY transcript…
#>  9 AT5G51990 0.00       5.45      Inf        CBF4      dehydration-responsive e…
#> 10 AT5G46880 651.22     1337.28   2.05       HB-7      homeobox-leucine zipper …
#> # … with 33 more rows
  • Related