Home > Blockchain >  Web Scraping multiple pages and combining the result in a dataframe by a for loop
Web Scraping multiple pages and combining the result in a dataframe by a for loop

Time:12-28

My code looks like this:

### Webpages 

# The first webpage 

url <- "https://www.finanzen.net/nachrichten/rubrik/ad-hoc-meldungen" 

### Function to extract the links of the data for the MAIN PAGE
scraplinks <- function(url){
  # Create an html document from the url
  webpage <- xml2::read_html(url)
  # Extract the URLs
  url_ <- webpage %>%
    rvest::html_nodes("a") %>%
    rvest::html_attr("href")
  # Extract the link text
  link_ <- webpage %>%
    rvest::html_nodes("a") %>%
    rvest::html_text()
  return(tibble(link = link_, url = url_))
}

urls <- scraplinks(url)
head(urls) # So this works 

The thing is data there more pages than one. See next code:

url <- "https://www.finanzen.net/nachrichten/rubrik/ad-hoc-meldungen@intpagenr_3"

So just adding "@intpagenr_3" brings one to the third page for example.

I wanna extract 10 pages on the homepage with the funtion above.

My try was:

more_than_one_page <- function(url,number_of_pages) {
  output1 <- scraplinks(url)
  for (i  in 1:number_of_pages){
    output2 <- data.frame()
    new_input <- scraplinks(paste0(url ,"@intpagenr_",i))
    output2[nrow(new_input),] <- new_input # Adding one line is nrow(new_input)   1
  }
  output <- rbind(output1, output2)
  
}
data1 <- more_than_one_page(url, 15)

But I dont now how to add new line, because I dont know the exact number of rows to intialize.

Somebody a guess? If something is not to understand, please ask. Thank you.

I tried a for loop so the index is the page number of the webpage. But I dont know how to initialize the exact rows for the dataframe.

CodePudding user response:

with base R, you could:

  • create URL list:
    urls <- paste0('https://www.finanzen.net/nachrichten/rubrik/',
                   'ad-hoc-meldungen@intpagenr_',
                   1:10
                   )
  • Map urls to the results of your function scraplinks and Reduce these singular items into their rbind row-bound block:
    all_data <-
        urls[1:3] |> ## * see footnote
        Map(f = scraplinks) |>
        Reduce(f = rbind)

* I only did this for pages 1-3; take care to comply with the service's policies re. harvesting.

The map (and reduce) strategy is often helpful when working with R structures, particularly to avoid loops. There's a dedicated package {purrr} to support this.

  • Related