Home > Software design >  Using for loop to scrape webpages in R
Using for loop to scrape webpages in R

Time:11-23

I am trying to scape multiple webpages by using the list of URLs (a csv file) This is my dataset: https://www.mediafire.com/file/9qh516tdcto7is7/nyp_data.csv/file The "url" column includes all the links that I am trying to use and scrape.

I tried to use for() loop by:

news_urls <- read_csv("nyp_data.csv")

library(rvest)
content_list <- vector()
for (i in 1:nrow(news_urls)) {
  nyp_url <- news_urls[i, 'url']
  nyp_html <- read_html(nyp_url)
  nyp_nodes <- nyp_html %>% 
    html_elements(".single__content")
  tag_name = ".single__content"
  nyp_texts <- nyp_html %>% 
    html_elements(tag_name) %>% 
    html_text()
{    content_list[i] <- nyp_texts[1]
  }}

However, I am getting an error that says: Error in UseMethod("read_xml") : no applicable method for 'read_xml' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

I believe the links that I have work well; they aren't broken and I can access to them by clicking an individual link.

If for loop isn't the one that I should be using here, do have any other idea to scarpe the content?

*I also tried:

urls <- news_urls[,5] #identify the column with the urls
url_xml <- try(apply(urls, 1, read_html)) #apply the function read_html() to the `url` vector

textScraper <- function(x) {
  html_text(html_nodes (x, ".single__content")) %>% #in this data, my text is in a node called ".single__content"
    str_replace_all("\n", "") %>%
    str_replace_all("\t", "") %>%
    paste(collapse = '')
}

article_text <- lapply(url_xml, textScraper)
article_text[1]

but it kept me giving an error Error in open.connection(x, "rb") : HTTP error 404.

CodePudding user response:

The error occures in this line:

nyp_html <- read_html(nyp_url)

As the error message tells you that the argument to read_xml (which is what is called internally by read_html) is a data.frame (amongst others, as it actually is a tibble).

This is because in this line:

nyp_url <- news_urls[i, 'url']

you are using single brackets to subset your data. Single brackets do return a data.frame containing the filtered data. You can avoid this by using double brackets like this:

nyp_url <- news_urls[[i, 'url']]

or this (which I usually find more readable):

nyp_url <- news_urls[i, ]$url

Either should fix your problem. If you want to read more about using these notations you could look at this answer.

  • Related