Home > Software engineering >  How to create a "for loop" in R which can web scrape data from each URL from a list of URL
How to create a "for loop" in R which can web scrape data from each URL from a list of URL

Time:10-14

I have a dataset on a Git Hub page. I imported them to Rstudio as a CSV file and created an array of URLs called "StoryLink" Now I want to web scrape data from each of these web pages. So I created a for loop and assign all of the collected data to a variable called "articleText" and converted it to a character array called "ArticlePage"

My problem is that even though I created a for loop it only web scrape the last web page (6th article) on the list of URLs. how do I web scrape all the URLs?

library(rvest)
library(dplyr)

GitHubpoliticsconversions<-  "https://raw.githubusercontent.com/lukanius007/web_scraping_politics/main/politics_conversions.csv"

CSVFile <- read.csv(GitHubpoliticsconversions, header = TRUE, sep = ",")

StoryLink <- c(pull(CSVFile, 4))

page <- {}

for(i in 1:6){
page[i] <- c(StoryLink[i])

ArticlePage <- read_html(page[i]) 

articleText = ArticlePage %>% html_elements(".lead , .article__title") %>% html_text()
PoliticalArticles <- c(articleText)

}

This is the result I got from this code but I need the same from all web pages

>PoliticalArticles
[1] "Wie es zur Hausdurchsuchung bei Finanzminister Blümel kam"                                                                                                                                 
[2] "Die Novomatic hatte den heutigen Finanzminister 2017 um Hilfe bei Problemen im Ausland gebeten – und eine Spende für die ÖVP angeboten. Eine solche habe er nicht angenommen, sagt Blümel."
>

CodePudding user response:

You need to store your retrieved website data in a data format that can grow progressively e.g. a list.

You can assign elements to a (previously created) list in for loops by utilising the i as your list naming. In the example below we simply store the result of each 2*i calculation in the data_list. Resulsts can then be retrieved by simply accessing the list element e.g. data_list[1]

data_list <- list()

for (i in 1:10) {
data_list[i] <- 2*i
}

data_list

data_list[1]

In your example, you can do exactly the same. N.b. I have slightly altered your code and simplified it. I iterate through your website list, so i is basically each weburl. Results are then stored as outlined above in a list that progressively grows in size and can be accessed via pages[1], or the respective url pages["https://www.diepresse.com/5958204"]

library(rvest)
library(dplyr)

GitHubpoliticsconversions<-  "https://raw.githubusercontent.com/lukanius007/web_scraping_politics/main/politics_conversions.csv"

CSVFile <- read.csv(GitHubpoliticsconversions, header = TRUE, sep = ",")

StoryLink <- c(pull(CSVFile, 4))

pages <- list()

for(i in StoryLink){

ArticlePage <- read_html(i)

articleText = ArticlePage %>% html_elements(".lead , .article__title") %>% html_text()
pages[[i]] <- c(articleText)

}
  • Related