Home > Software engineering >  Error message when using read_xml to scrape data
Error message when using read_xml to scrape data

Time:04-21

Being relative new to R web-scraping I am hoping for some help with a web-scraping project issue. I am wanting to scrape the data that generates the chart on this page.

Price Chart

I have inspected the page in Chrome and identified the link that returns the data.

Website Inspection Screenshot

Using this URL I have created the following code to parse the data

url <- 'https://www.solactive.com/Indices/?indexhistory=DE000SL0BBT0&indexhistorytype=max'
index_data <- read_xml(url)

Unfortunately I am receiving the error message

Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Failed to parse text

I have inspected the webpage that has the following

Response Headers

content-encoding: gzip
content-length: 20624
content-type: text/html; charset=UTF-8
date: Thu, 21 Apr 2022 00:33:05 GMT
server: nginx
strict-transport-security: max-age=63072000
vary: Accept-Encoding

Accept Headers (snapshot)

accept: application/json, text/javascript, */*; q=0.01
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9

I have also tried to apply the following encoding with no success

index_data <- read_xml(url, encoding = "gzip, deflate, br")

What I am after is a data table with index_id, date, value

Any assistance would be appreciated.

Thank you

CodePudding user response:

Not sure why in R, despite setting various headers the response remains html, whereas with Python it is sufficient only to pass the referrer header and get JSON back. However, bit of a faff and you can extract from a p tag in the response and parse with jsonlite

library(httr2)
library(rvest)

headers = c('referer' = 'https://www.solactive.com/Indices/?index=DE000SL0BBT0')

params = list('indexhistory' = 'DE000SL0BBT0', 'indexhistorytype' = 'max')

data <- request("https://www.solactive.com/Indices/") |> 
  (\(x) req_headers(x,  !!!headers))() |>  
  req_url_query(!!!params) |> 
  req_perform() |> 
  resp_body_html() |> 
  html_element('p') |>  
  html_text() |>  
  jsonlite::parse_json(simplifyVector = T)
  • Related