I am currently working a project to scrape the content of the Performance Characteristics table on this website
https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund
The data I am wanting to extract from this table is the 12 m trailing yield of 3.43%
The code I wrote to do this is:
url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund"
etf_Data <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="fundamentalsAndRisk"]/div') %>%
html_table()
etf_Data <- etf_Data[[1]]
which provided me with an empty list with the error message 'Error in etf_Data[[1]] : subscript out of bounds'
Using Google inspect I have tried various XPaths including reading it in html_text:
url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund"
etf_Data <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="fundamentalsAndRisk"]/div/div[4]/span[2]') %>%
html_text()
etf_Data <- etf_Data[[1]]
However with no success.
Having gone through other Stack Overflow responses I have not been able to solve my issue.
Would someone be able to assist.
Thank you C
CodePudding user response:
Couple of things:
- There is a different URI you end up at in order to get the content you want. This comes when you manually accept certain conditions on the page
- The data you want is not within a table
You can add a queryString with EntryPassthrough parameter = True to get to the right URI and then use :contains and an adjacent sibling combinator to get the desired value
library(rvest)
library(magrittr)
url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund?switchLocale=y&siteEntryPassthrough=true"
trailing_12m_yield <- url %>%
read_html() %>%
html_element('.caption:contains("12m Trailing Yield") .data') %>% html_text2()