Home > Enterprise >  Empty variable when attempting to scrape web data using RVest
Empty variable when attempting to scrape web data using RVest

Time:06-15

I am trying to use Rvest to scrape a data point from:

https://www.vanguardinvestor.co.uk/investments/vanguard-ftse-developed-europe-ex-uk-ucits-etf-eur-distributing/distributions

What I am attempting to capture is the "Yield As at close 30 Apr 2022" number which is 2.53%

I have attempted this using the following code

url <- "https://www.vanguardinvestor.co.uk/investments/vanguard-ftse-developed-europe-ex-uk-ucits-etf-eur-distributing/distributions"

  url_read <- url %>%
    read_html()
  
  etf_Data <- url_read %>%
    html_nodes(xpath='/html/body/ukd-app/ukd-pla-nav/div[1]/ukd-fund-detail/div[2]/ukd-distributions/dl/div[2]') %>%
    html_text()

however is is returning character(0).

Based on previous responses on SO I have tried to see if a passthrough query is required in the URL however my knowledge is fairly limited so have been unable to tell if it is required.

I have also tried

etf_Data <- url_read %>%
    html_element('.caption:contains("Yield As at close 30 Apr 2022")   .data') %>% html_text2()

and

etf_Data <- url_read %>%
    html_nodes(xpath='/html/body/ukd-app/ukd-pla-nav/div[1]/ukd-fund-detail/div[2]/ukd-distributions/dl/div[2]') %>%
    html_table()

with the same response.

Any help you could provide would be appreciated.

Thanks C

CodePudding user response:

The problem is, that the data is loaded dynamically to the Page using JavaScript. You could work around this using Rselenium.

A much simpler solution is - with a slight modification of the Url - to request the data from the API:

library(httr)

resp <- GET("https://www.vanguardinvestor.co.uk/api/fund-detail/vanguard-ftse-developed-europe-ex-uk-ucits-etf-eur-distributing") %>% content()
yield <- resp$fundData$distributionHistory$yield[[1]]
  • Related