Is it possible to scrape the contents of a datatable
, or the embedded .csv
file, from a .html
file produced using knitr
? E.g. using DT::datatable()
with the option buttons = c('csv')
inside a .Rmd
file.
My first approach is to use rvest
(e.g. rvest::read_html(x = 'example.html') %>% rvest::html_table()
), but it doesn't find any tables, I assume that's because datatable
doesn't use the html table element to display data.
Below is a minimal example of an .Rmd
file that I knit a '.html' file from that includes two datatables I'd like to scrape.
example.Rmd
---
title: "Example for scraping"
output: html_document
---
# Example 1
```{r}
tbl1 <- mtcars[1:20, 1:4]
DT::datatable(tbl1, extensions = 'Buttons', options = list(dom = 'Blfrtip',buttons = c('csv'),paging=FALSE))
```
# Example 2
```{r}
tbl2 <- mtcars[21:32, 7:11]
DT::datatable(tbl2, extensions = 'Buttons', options = list(dom = 'Blfrtip',buttons = c('csv'),paging=FALSE))
```
This can be knitted within RStudio, or using rmarkdown::render(input = 'example.Rmd')
, resulting in the example.html
I'd like to scrape from.
Thank you!
CodePudding user response:
The htmlwidgets
, that renders into htmltables
, stores the data as JSON that you could access and use to rebuilt dataframes. In this way you would also get around all kinds of settings for the pagination. I've made an illustrative example of this approach below:
library(tidyverse)
library(rvest)
library(jsonlite)
# Get the JSON that is used to create the table
json_data <-
read_html("example.html") |>
html_elements(css = "#example-1") |>
html_elements("script") |>
html_text() |>
jsonlite::fromJSON()
# Extract data and reshape it
table <-
json_data$x[["data"]] |>
as_tibble() |>
t() |>
as_tibble()
# Extract variable names from html stored in the json
variable_names <-
minimal_html(json_data$x[["container"]]) |>
html_table() %>%
.[[1]] |>
colnames()
# Add the correct column names
colnames(table) <- variable_names
table
Output:
# A tibble: 20 × 5
`` mpg cyl disp hp
<chr> <chr> <chr> <chr> <chr>
1 Mazda RX4 21 6 160 110
2 Mazda RX4 Wag 21 6 160 110
3 Datsun 710 22.8 4 108 93
4 Hornet 4 Drive 21.4 6 258 110
5 Hornet Sportabout 18.7 8 360 175
6 Valiant 18.1 6 225 105
7 Duster 360 14.3 8 360 245
8 Merc 240D 24.4 4 146.7 62
9 Merc 230 22.8 4 140.8 95
10 Merc 280 19.2 6 167.6 123
11 Merc 280C 17.8 6 167.6 123
12 Merc 450SE 16.4 8 275.8 180
13 Merc 450SL 17.3 8 275.8 180
14 Merc 450SLC 15.2 8 275.8 180
15 Cadillac Fleetwood 10.4 8 472 205
16 Lincoln Continental 10.4 8 460 215
17 Chrysler Imperial 14.7 8 440 230
18 Fiat 128 32.4 4 78.7 66
19 Honda Civic 30.4 4 75.7 52
20 Toyota Corolla 33.9 4 71.1 65
Update. A way to get all the tables using the same approach, i.e. not the most beautiful code:
# Get the html of the the scripts
html_data <-
read_html("test2.html") |>
html_elements(css = "script[type='application/json']") |>
html_text()
get_table <- function(json) {
# Read the JSON
json_data <- jsonlite::fromJSON(json)
# Extract data and reshape it
table <-
json_data$x[["data"]] |>
as_tibble() |>
t() |>
as_tibble()
# Extract variable names from html
variable_names <-
minimal_html(json_data$x[["container"]]) |>
html_table() %>%
.[[1]] |>
colnames()
# Add column names
colnames(table) <- variable_names
table
}
tables <- lapply(html_data, get_table)
tables