I am trying to scrape multiple RSS links in R (those are 800 news articles)
I was able to scrape individual URLs by:
cnn_url <- "http://rss.cnn.com/~r/rss/cnn_travel/~3/-GFuCIsYZgQ/index.html"
cnn_html <- read_html(cnn_url)
cnn_html
cnn_nodes <- cnn_html %>% html_elements(".Article__body")
#look for texts
cnn_texts <- cnn_html %>%
html_elements(".Article__body") %>%
html_text()
cnn_texts[1]
But I am trying to scrape 800 articles (the main text of news stories) in R, and I can't run the codes above for each URL because I have more than 800 links. So I used the codes:
cnn_data <- news_data %>%
filter(media_name == "CNN")
head(cnn_data$url)
head(cnn_data)
urls <- cnn_data[,4] #column with url
url_xml <- try(apply(urls, 1, read_html))
textScraper <- function(x) {
html_text(html_nodes (x, ".Article__body") %>%
html_nodes("p")) %>%
paste(collapse = '')}
cnn_text <- lapply(url_xml, textScraper)
cnn_text[1]
cnn_data$full_article <- cnn_text
head(cnn_data$full_article)
But when I ran the line:
url_xml <- try(apply(urls, 1, read_html))
I got an error message that says: Error in open.connection(x, "rb") : HTTP error 404.
I assume this may be because the URLs are linked to RSS; is there any way I can scrape those news stories by using the URLs that I have?
FYI: data file consists of rows that have links like this--
http://rss.cnn.com/~r/rss/cnn_travel/~3/-GFuCIsYZgQ/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/WvpC9ZKjJXo/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/przZf_johNY/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/TieFj4roU_M/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/iqRZ7f8MhzQ/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/Uq46bJROhiI/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/6u-D9sna6uY/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/JNTXgcM1yY0/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/WG8UTHcZvwQ/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/6YHMwdj6W7s/index.html
CodePudding user response:
Your try()
statement is testing the return from the call to apply()
, thus if there is one bad link the apply will error and then the try statement takes over. You need to wrap the try around read_html
and not the apply.
Something like this should work, returning a list of web pages. Note all of your above links work.
library(rvest)
mylist<-lapply(urls, function(url) {
#be kind and not attack the server
Sys.sleep(1)
print(url) #debug
url_xml<-try(read_html(url))
})
Yes, it possible to code to handle the different pages, but that is potentially a much bigger question to answer.