How to find all Event IDs in an efficient way?-CodePudding

How could I crawl this database with rvest to identify all tournament IDs for each year? Currently, I'm just going from 1:maxx(event_id), which is really a drain on compute time.

https://www.worldloppet.com/results/

The results filter seems to be dynamic on the webpage, so the url doesn't change.

outlist <- list()

for (event_id in 2483:2570) {
  event_id = 2483
  # update progress
  message('Retrieving Event ',event_id)
  
  race_url = paste0('https://www.worldloppet.com/browse/?id=',event_id)
  
  event_info = read_html(race_url) %>% 
    html_nodes('h2') %>%
    .[1] %>%
    gsub('<br>','<br>  ',.) %>%
    gsub("<[^>] >", "",.) %>%
    str_split(.,'  ') %>%
    unlist()
  
  #event_info$eventid <- event_id
  
  outlist <- c(outlist, list(c(event_id, event_info)))
  
  # temporary break 
  Sys.sleep(3)
  
}

CodePudding user response：

You can extract all links containing the word browse from the HTML document:

library(tidyverse)
library(rvest)
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

read_html("https://www.worldloppet.com/results/") %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  as.character() %>%
  keep(~ .x %>% str_detect("browse")) %>%
  paste0("https://www.worldloppet.com",.)
#>  [1] "https://www.worldloppet.com/browse/?id=2570"
#>  [2] "https://www.worldloppet.com/browse/?id=1818"
#>  [3] "https://www.worldloppet.com/browse/?id=1817"
#>  [4] "https://www.worldloppet.com/browse/?id=2518"
#>  [5] "https://www.worldloppet.com/browse/?id=2517"

^{Created on 2022-02-09 by the reprex package (v2.0.1)}

CodePudding user response：

The IDs of the rage can be found in the links, which can be extracted using the html_attr function. From there we can use some regex to find the numbers, here I include id= to make sure the page is an id, as I'm not sure whether you want to include links like masters=9173.

library(rvest)
library(stringi)
url <- "https://www.worldloppet.com/results/"
page <- read_html(url)
string <- html_attr(html_elements(page, "a"), "href")

matches <- stri_extract_all_regex(string, "(?<=id=).*", simplify = T)
as.integer(matches[!is.na(matches)])

# first 5
[1] 2570 1818 1817 2518 2517