My code looks like this:
### Webpages
# The first webpage
url <- "https://www.finanzen.net/nachrichten/rubrik/ad-hoc-meldungen"
### Function to extract the links of the data for the MAIN PAGE
scraplinks <- function(url){
# Create an html document from the url
webpage <- xml2::read_html(url)
# Extract the URLs
url_ <- webpage %>%
rvest::html_nodes("a") %>%
rvest::html_attr("href")
# Extract the link text
link_ <- webpage %>%
rvest::html_nodes("a") %>%
rvest::html_text()
return(tibble(link = link_, url = url_))
}
urls <- scraplinks(url)
head(urls) # So this works
The thing is data there more pages than one. See next code:
url <- "https://www.finanzen.net/nachrichten/rubrik/ad-hoc-meldungen@intpagenr_3"
So just adding "@intpagenr_3" brings one to the third page for example.
I wanna extract 10 pages on the homepage with the funtion above.
My try was:
more_than_one_page <- function(url,number_of_pages) {
output1 <- scraplinks(url)
for (i in 1:number_of_pages){
output2 <- data.frame()
new_input <- scraplinks(paste0(url ,"@intpagenr_",i))
output2[nrow(new_input),] <- new_input # Adding one line is nrow(new_input) 1
}
output <- rbind(output1, output2)
}
data1 <- more_than_one_page(url, 15)
But I dont now how to add new line, because I dont know the exact number of rows to intialize.
Somebody a guess? If something is not to understand, please ask. Thank you.
I tried a for loop so the index is the page number of the webpage. But I dont know how to initialize the exact rows for the dataframe.
CodePudding user response:
with base R, you could:
- create URL list:
urls <- paste0('https://www.finanzen.net/nachrichten/rubrik/',
'ad-hoc-meldungen@intpagenr_',
1:10
)
Map
urls to the results of your functionscraplinks
andReduce
these singular items into theirrbind
row-bound block:
all_data <-
urls[1:3] |> ## * see footnote
Map(f = scraplinks) |>
Reduce(f = rbind)
* I only did this for pages 1-3; take care to comply with the service's policies re. harvesting.
The map (and reduce) strategy is often helpful when working with R structures, particularly to avoid loops. There's a dedicated package {purrr} to support this.