With this code I scrap the unique table from the URL:
library (rvest)
library(dplyr)
WA_link<-"https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/women/senior/2021?page=1"
WA_page<-read_html(WA_link)
WA_table<-WA_page %>% html_nodes("table.records-table") %>%
html_table() %>% . [[1]]
I want to have all tables at the same data frame (or a given number) and then to remove duplicate rows (headings). I know I need to built up a loop but I am not skilled about that. Could somebody give me a hand? Thank you.
CodePudding user response:
Here is a loop to scrape and combine pages 1-4. I am not sure how to scrape the website to see how many pages are in the set, so for now the number of pages should be changed manually.
pages <- 1:4 # where 4==whatever the number of pages is..
WA_list=list()
for(i in seq_along(pages)){
WA_link<-paste0("https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/women/senior/2021?page=",pages[i])
WA_page<-read_html(WA_link)
WA_list[[i]] <- WA_page %>% html_nodes("table.records-table") %>%
html_table() %>% . [[1]]
}
WA_table <- dplyr::bind_rows(WA_list)
Alternatively, you can scan many more than expected.
pages <- c(1:100)
WA_list=vector("list", length(pages))
## "pre-allocate" an empty list of length 5
for(i in seq_along(pages)){
print(i)
WA_link<-paste0("https://www.worldathletics.org/records/toplists/sprints/100-metres/outdoor/women/senior/2021?page=",pages[i])
WA_page<-read_html(WA_link)
WA_list[[i]] <- WA_page %>% html_nodes("table.records-table") %>%
html_table() %>% . [[1]]
WA_table <- dplyr::bind_rows(WA_list) # this is a crude solution to creating a data frame while allowing the loop to stop when the max page has been reached. ideally, there would be a logical here for when no data is retrieved on pages[i]
}
Note: Hopefully someone can edit this answer to exit the loop at i where no data exists in WA_page %>% html_nodes("table.records-table") %>% html_table() %>% . [[1]]
. This solution would then move the bind_rows()
to after the loop to prevent redundant processes.