I am trying to scape multiple webpages by using the list of URLs (a csv file) This is my dataset: https://www.mediafire.com/file/9qh516tdcto7is7/nyp_data.csv/file The "url" column includes all the links that I am trying to use and scrape.
I tried to use for() loop by:
news_urls <- read_csv("nyp_data.csv")
library(rvest)
content_list <- vector()
for (i in 1:nrow(news_urls)) {
nyp_url <- news_urls[i, 'url']
nyp_html <- read_html(nyp_url)
nyp_nodes <- nyp_html %>%
html_elements(".single__content")
tag_name = ".single__content"
nyp_texts <- nyp_html %>%
html_elements(tag_name) %>%
html_text()
{ content_list[i] <- nyp_texts[1]
}}
However, I am getting an error that says: Error in UseMethod("read_xml") : no applicable method for 'read_xml' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"
I believe the links that I have work well; they aren't broken and I can access to them by clicking an individual link.
If for loop isn't the one that I should be using here, do have any other idea to scarpe the content?
*I also tried:
urls <- news_urls[,5] #identify the column with the urls
url_xml <- try(apply(urls, 1, read_html)) #apply the function read_html() to the `url` vector
textScraper <- function(x) {
html_text(html_nodes (x, ".single__content")) %>% #in this data, my text is in a node called ".single__content"
str_replace_all("\n", "") %>%
str_replace_all("\t", "") %>%
paste(collapse = '')
}
article_text <- lapply(url_xml, textScraper)
article_text[1]
but it kept me giving an error Error in open.connection(x, "rb") : HTTP error 404.
CodePudding user response:
The error occures in this line:
nyp_html <- read_html(nyp_url)
As the error message tells you that the argument to read_xml
(which is what is called internally by read_html
) is a data.frame
(amongst others, as it actually is a tibble).
This is because in this line:
nyp_url <- news_urls[i, 'url']
you are using single brackets to subset your data. Single brackets do return a data.frame containing the filtered data. You can avoid this by using double brackets like this:
nyp_url <- news_urls[[i, 'url']]
or this (which I usually find more readable):
nyp_url <- news_urls[i, ]$url
Either should fix your problem. If you want to read more about using these notations you could look at this answer.