I'm new to web scraping and am trying to scrape the data from this interactive chart using R so that all the series are displayed in a single table: https://www.e61.in/spendtracker
I've used developer tools in chrome (inspect - network - fetch/XHR) but cannot find the data points.
Would be highly appreciative if someone can take a quick look and let me know a) if the data points are stored on the page somewhere b) if possible, explain how they identified the right file, and c) if it is a reasonably straightforward task to then generate a table?
CodePudding user response:
You're trying to scrap the wrong URL - the one you've provided uses an iframe with the chart. You should take a deep look into the source code of this page instead (the iframe source): https://www-e61-in.filesusr.com/html/84f6c1_839cefc8bcc59c1cc688a6be6b4a5656.html
CodePudding user response:
Continuing from that iframe url -
before switching to R & rvest you should check the actual page source and perhaps run it though some beautifier. You'll see Plotly.newPlot()
call, check how it gets array of those data series as a 2nd parameter. One option would be extracting that piece of javascript with regex, parse it as JSON and work from there.
Perhaps something like this:
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(jsonlite)
library(purrr)
library(ggplot2)
url <- "https://www-e61-in.filesusr.com/html/84f6c1_839cefc8bcc59c1cc688a6be6b4a5656.html"
html <- read_html(url)
# extract last <script> tag containing Plotly.newPlot() and dataseries'
plotly_js <- html %>%
html_element("script:last-of-type") %>%
html_text()
# extract array from js string, using \Q and \E to no escape all special chars
p_dataseries <- str_extract(plotly_js, '\\Q[{"connectgaps"\\E.*?\\Q"type":"scatter"}]\\E' )
# parse extracted string
ds_j <- fromJSON(p_dataseries,simplifyVector = FALSE)
# extract data, result will be in long format
df <- map_df(ds_j, `[`, c("name", "x", "y")) %>%
unnest(c(x,y)) %>%
mutate(date = as.POSIXct(x))
str(df)
#> tibble [2,346 × 4] (S3: tbl_df/tbl/data.frame)
#> $ name: chr [1:2346] "Total" "Total" "Total" "Total" ...
#> $ x : chr [1:2346] "2020-01-12T00:00:00" "2020-01-19T00:00:00" "2020-01-26T00:00:00" "2020-02-02T00:00:00" ...
#> $ y : num [1:2346] 100 100.1 100.7 99.3 97.8 ...
#> $ date: POSIXct[1:2346], format: "2020-01-12" "2020-01-19" ...
head(df)
#> # A tibble: 6 × 4
#> name x y date
#> <chr> <chr> <dbl> <dttm>
#> 1 Total 2020-01-12T00:00:00 100 2020-01-12 00:00:00
#> 2 Total 2020-01-19T00:00:00 100. 2020-01-19 00:00:00
#> 3 Total 2020-01-26T00:00:00 101. 2020-01-26 00:00:00
#> 4 Total 2020-02-02T00:00:00 99.3 2020-02-02 00:00:00
#> 5 Total 2020-02-09T00:00:00 97.8 2020-02-09 00:00:00
#> 6 Total 2020-02-16T00:00:00 100. 2020-02-16 00:00:00
p <- df %>%
ggplot(aes(x = date, y = y, color = name))
geom_path()
theme_minimal()
p
Created on 2022-09-27 with reprex v2.0.2