How could I crawl this database with rvest to identify all tournament IDs for each year? Currently, I'm just going from 1:maxx(event_id), which is really a drain on compute time.
https://www.worldloppet.com/results/
The results filter seems to be dynamic on the webpage, so the url doesn't change.
outlist <- list()
for (event_id in 2483:2570) {
event_id = 2483
# update progress
message('Retrieving Event ',event_id)
race_url = paste0('https://www.worldloppet.com/browse/?id=',event_id)
event_info = read_html(race_url) %>%
html_nodes('h2') %>%
.[1] %>%
gsub('<br>','<br> ',.) %>%
gsub("<[^>] >", "",.) %>%
str_split(.,' ') %>%
unlist()
#event_info$eventid <- event_id
outlist <- c(outlist, list(c(event_id, event_info)))
# temporary break
Sys.sleep(3)
}
CodePudding user response:
You can extract all links containing the word browse from the HTML document:
library(tidyverse)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
read_html("https://www.worldloppet.com/results/") %>%
html_nodes("a") %>%
html_attr("href") %>%
as.character() %>%
keep(~ .x %>% str_detect("browse")) %>%
paste0("https://www.worldloppet.com",.)
#> [1] "https://www.worldloppet.com/browse/?id=2570"
#> [2] "https://www.worldloppet.com/browse/?id=1818"
#> [3] "https://www.worldloppet.com/browse/?id=1817"
#> [4] "https://www.worldloppet.com/browse/?id=2518"
#> [5] "https://www.worldloppet.com/browse/?id=2517"
Created on 2022-02-09 by the reprex package (v2.0.1)
CodePudding user response:
The IDs of the rage can be found in the links, which can be extracted using the html_attr
function. From there we can use some regex to find the numbers, here I include id=
to make sure the page is an id, as I'm not sure whether you want to include links like masters=9173
.
library(rvest)
library(stringi)
url <- "https://www.worldloppet.com/results/"
page <- read_html(url)
string <- html_attr(html_elements(page, "a"), "href")
matches <- stri_extract_all_regex(string, "(?<=id=).*", simplify = T)
as.integer(matches[!is.na(matches)])
# first 5
[1] 2570 1818 1817 2518 2517