I have a dataset on a Git Hub page. I imported them to Rstudio as a CSV file and created an array of URLs called "StoryLink" Now I want to web scrap data from each of these web pages. So I created a for loop and assign all of the collected data to a variable called "articleText" and converted it to a character array called "ArticlePage"
My problem is that even though I created a for loop it only web scrap the last web page (6th article) on the list of URLs. how do I web scrap all the URLs?
library(rvest)
library(dplyr)
GitHubpoliticsconversions<- "https://raw.githubusercontent.com/lukanius007/web_scraping_politics/main/politics_conversions.csv"
CSVFile <- read.csv(GitHubpoliticsconversions, header = TRUE, sep = ",")
StoryLink <- c(pull(CSVFile, 4))
page <- {}
for(i in 1:6){
page[i] <- c(StoryLink[i])
ArticlePage <- read_html(page[i])
articleText = ArticlePage %>% html_elements(".lead , .article__title") %>% html_text()
PoliticalArticles <- c(articleText)
}
This is the result I got from this code but I need the same from all web pages
>PoliticalArticles
[1] "Wie es zur Hausdurchsuchung bei Finanzminister Blümel kam"
[2] "Die Novomatic hatte den heutigen Finanzminister 2017 um Hilfe bei Problemen im Ausland gebeten – und eine Spende für die ÖVP angeboten. Eine solche habe er nicht angenommen, sagt Blümel."
>
CodePudding user response:
You need to store your retrieved website data in a data format that can grow progressively e.g. a list.
You can assign elements to a (previously created) list in for loops by utilising the i
as your list naming. In the example below we simply store the result of each 2*i calculation in the data_list. Resulsts can then be retrieved by simply accessing the list element e.g. data_list[1]
data_list <- list()
for (i in 1:10) {
data_list[i] <- 2*i
}
data_list
data_list[1]
In your example, you can do exactly the same. N.b. I have slightly altered your code and simplified it. I iterate through your website list, so i
is basically each weburl. Results are then stored as outlined above in a list that progressively grows in size and can be accessed via pages[1]
, or the respective url pages["https://www.diepresse.com/5958204"]
library(rvest)
library(dplyr)
GitHubpoliticsconversions<- "https://raw.githubusercontent.com/lukanius007/web_scraping_politics/main/politics_conversions.csv"
CSVFile <- read.csv(GitHubpoliticsconversions, header = TRUE, sep = ",")
StoryLink <- c(pull(CSVFile, 4))
pages <- list()
for(i in StoryLink){
ArticlePage <- read_html(i)
articleText = ArticlePage %>% html_elements(".lead , .article__title") %>% html_text()
pages[[i]] <- c(articleText)
}