R Scraping multiple pages using rvest-CodePudding

With this code I scrap the unique table from the URL:

library (rvest)
library(dplyr)

WA_link<-"https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/women/senior/2021?page=1"
WA_page<-read_html(WA_link)

WA_table<-WA_page  %>% html_nodes("table.records-table") %>%
  html_table() %>% . [[1]]

I want to have all tables at the same data frame (or a given number) and then to remove duplicate rows (headings). I know I need to built up a loop but I am not skilled about that. Could somebody give me a hand? Thank you.

CodePudding user response：

Here is a loop to scrape and combine pages 1-4. I am not sure how to scrape the website to see how many pages are in the set, so for now the number of pages should be changed manually.

pages <- 1:4 # where 4==whatever the number of pages is..

WA_list=list()
for(i in seq_along(pages)){
  WA_link<-paste0("https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/women/senior/2021?page=",pages[i])
  WA_page<-read_html(WA_link)

  WA_list[[i]] <- WA_page  %>% html_nodes("table.records-table") %>%
    html_table() %>% . [[1]]

}
WA_table <- dplyr::bind_rows(WA_list)

Alternatively, you can scan many more than expected.

pages <- c(1:100)

WA_list=vector("list", length(pages))
  ## "pre-allocate" an empty list of length 5
for(i in seq_along(pages)){
  print(i)
  WA_link<-paste0("https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/women/senior/2021?page=",pages[i])
  WA_page<-read_html(WA_link)

  WA_list[[i]] <- WA_page  %>% html_nodes("table.records-table") %>%
    html_table() %>% . [[1]]
  WA_table <- dplyr::bind_rows(WA_list) # this is a crude solution to creating a data frame while allowing the loop to stop when the max page has been reached. ideally, there would be a logical here for when no data is retrieved on pages[i]
}

Note: Hopefully someone can edit this answer to exit the loop at i where no data exists in WA_page %>% html_nodes("table.records-table") %>% html_table() %>% . [[1]]. This solution would then move the bind_rows() to after the loop to prevent redundant processes.