I'm trying to scrape xml data into a dataframe from this website:
However my code keeps throwing up errors all over the place:
Such as: Error in open.connection(x, "rb") : Timeout was reached: [www.dmo.gov.uk] Connection timed out after 10001 milliseconds
Code is below:
library(data.table)
library(rvest)
library(xml2)
url <- read_html("https://www.dmo.gov.uk/data/XmlDataReport?reportCode=D1A")
dt <- rbindlist(lapply(url %>% html_nodes(css = "body > View_GILTS_IN_ISSUE > View_GILTS_IN_ISSUE") %>%
xml_attrs(),
function(x) as.data.table(t((x)))))
dt <- cbind(dt[,9, with = TRUE],
as.data.table(lapply(dt[,-9, with = TRUE], as.character)))
dt
Does anyone have any advice on how I can take this to completion?
CodePudding user response:
When I first tried I fell at the first hurdle of actually downloading the file. I was consistently getting the error message: fatal SSL/TLS alert is received (e.g. handshake failed)
I eventually found a solution here
CodePudding user response:
Was able to fix the issue with a combination of mkpt_uk's answer, and the one available here: Package "rvest" for web scraping https site with proxy
So downloading the file using:
download.file(url, destfile = destination)
followed by:
content <- read_xml(file)