Home > OS >  Programmatically extract/scrape .csv tables embedded in rendered html file using `datatable`
Programmatically extract/scrape .csv tables embedded in rendered html file using `datatable`

Time:06-15

Is it possible to scrape the contents of a datatable, or the embedded .csv file, from a .html file produced using knitr? E.g. using DT::datatable() with the option buttons = c('csv') inside a .Rmd file.

My first approach is to use rvest (e.g. rvest::read_html(x = 'example.html') %>% rvest::html_table()), but it doesn't find any tables, I assume that's because datatable doesn't use the html table element to display data.

Below is a minimal example of an .Rmd file that I knit a '.html' file from that includes two datatables I'd like to scrape.

example.Rmd

    ---
    title: "Example for scraping"
    output: html_document
    ---
    
    # Example 1
    
    ```{r}
    tbl1 <- mtcars[1:20, 1:4]
    DT::datatable(tbl1, extensions = 'Buttons', options = list(dom = 'Blfrtip',buttons = c('csv'),paging=FALSE))
    ```
    
    # Example 2
   
    ```{r}
    tbl2 <- mtcars[21:32, 7:11]
    DT::datatable(tbl2, extensions = 'Buttons', options = list(dom = 'Blfrtip',buttons = c('csv'),paging=FALSE))
    ```

This can be knitted within RStudio, or using rmarkdown::render(input = 'example.Rmd'), resulting in the example.html I'd like to scrape from.

Thank you!

CodePudding user response:

The htmlwidgets, that renders into htmltables, stores the data as JSON that you could access and use to rebuilt dataframes. In this way you would also get around all kinds of settings for the pagination. I've made an illustrative example of this approach below:

library(tidyverse)
library(rvest)
library(jsonlite)

# Get the JSON that is used to create the table
json_data <-
  read_html("example.html") |> 
  html_elements(css = "#example-1") |>
  html_elements("script") |>
  html_text() |>
  jsonlite::fromJSON()

# Extract data and reshape it
table <-
  json_data$x[["data"]] |>
  as_tibble() |>
  t() |> 
  as_tibble()

# Extract variable names from html stored in the json
variable_names <-
  minimal_html(json_data$x[["container"]]) |>
  html_table() %>%
  .[[1]] |>
  colnames()

# Add the correct column names
colnames(table) <- variable_names

table

Output:

# A tibble: 20 × 5
   ``                  mpg   cyl   disp  hp   
   <chr>               <chr> <chr> <chr> <chr>
 1 Mazda RX4           21    6     160   110  
 2 Mazda RX4 Wag       21    6     160   110  
 3 Datsun 710          22.8  4     108   93   
 4 Hornet 4 Drive      21.4  6     258   110  
 5 Hornet Sportabout   18.7  8     360   175  
 6 Valiant             18.1  6     225   105  
 7 Duster 360          14.3  8     360   245  
 8 Merc 240D           24.4  4     146.7 62   
 9 Merc 230            22.8  4     140.8 95   
10 Merc 280            19.2  6     167.6 123  
11 Merc 280C           17.8  6     167.6 123  
12 Merc 450SE          16.4  8     275.8 180  
13 Merc 450SL          17.3  8     275.8 180  
14 Merc 450SLC         15.2  8     275.8 180  
15 Cadillac Fleetwood  10.4  8     472   205  
16 Lincoln Continental 10.4  8     460   215  
17 Chrysler Imperial   14.7  8     440   230  
18 Fiat 128            32.4  4     78.7  66   
19 Honda Civic         30.4  4     75.7  52   
20 Toyota Corolla      33.9  4     71.1  65   

Update. A way to get all the tables using the same approach, i.e. not the most beautiful code:

# Get the html of the the scripts
html_data <-
  read_html("test2.html") |> 
  html_elements(css = "script[type='application/json']") |>
  html_text() 

get_table <- function(json) {

  # Read the JSON
  json_data <- jsonlite::fromJSON(json)
  
  # Extract data and reshape it
  table <-
    json_data$x[["data"]] |>
    as_tibble() |>
    t() |> 
    as_tibble()
  
  # Extract variable names from html
  variable_names <-
    minimal_html(json_data$x[["container"]]) |>
    html_table() %>%
    .[[1]] |>
    colnames()
  
  # Add column names
  colnames(table) <- variable_names
  
  table
  
}

tables <- lapply(html_data, get_table)

tables
  • Related