Home > database >  How to scrape links off a website in R? Unable to to scrape links
How to scrape links off a website in R? Unable to to scrape links

Time:09-15

I'm trying to scrape the links off of this website

library(rvest)
library(tidyverse)
url=read_html('https://web.archive.org/web/*/https://www.bjjcompsystem.com/tournaments/1869/categories*')

get_links <- url %>% 
  html_nodes('#resultsUrl a') %>% 
  html_attr('href') %>%
  paste0('https://web.archive.org/web/20220000000000*/', .)
get_links

But all I get is character(0). I even tried looking for the li class as has been suggested to me before, but there do not appear to be any.

There's clearly something I'm not understanding when it comes to scraping the links. I know I'm on the right track since I've done it before, but there's a detail missing somewhere. Can someone explain what I'm doing wrong and how to fix it?

CodePudding user response:

Get the links from their source

library(tidyverse)
library(httr2)
library(janitor)

"https://web.archive.org/web/timemap/json?url=https://www.bjjcompsystem.com/tournaments/1869/categories&matchType=prefix&collapse=urlkey&output=json&fl=original,mimetype,timestamp,endtimestamp,groupcount,uniqcount&filter=!statuscode:[45]..&limit=10000&_=1663136483842" %>% 
  request() %>% 
  req_perform() %>% 
  resp_body_json(simplifyVector = TRUE) %>% 
  as_tibble() %>% 
  row_to_names(1)

# A tibble: 784 × 6
   original                                           mimet…¹ times…² endti…³ group…⁴ uniqc…⁵
   <chr>                                              <chr>   <chr>   <chr>   <chr>   <chr>  
 1 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 3       3      
 2 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 6       6      
 3 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 2       2      
 4 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1       1      
 5 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 2       2      
 6 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1       1      
 7 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1       1      
 8 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 2       2      
 9 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1       1      
10 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1       1      
# … with 774 more rows, and abbreviated variable names ¹​mimetype, ²​timestamp, ³​endtimestamp,
#   ⁴​groupcount, ⁵​uniqcount
# ℹ Use `print(n = ...)` to see more rows
  • Related