I'm trying to scrape the links off of this website
library(rvest)
library(tidyverse)
url=read_html('https://web.archive.org/web/*/https://www.bjjcompsystem.com/tournaments/1869/categories*')
get_links <- url %>%
html_nodes('#resultsUrl a') %>%
html_attr('href') %>%
paste0('https://web.archive.org/web/20220000000000*/', .)
get_links
But all I get is character(0)
. I even tried looking for the li class
as has been suggested to me before, but there do not appear to be any.
There's clearly something I'm not understanding when it comes to scraping the links. I know I'm on the right track since I've done it before, but there's a detail missing somewhere. Can someone explain what I'm doing wrong and how to fix it?
CodePudding user response:
Get the links from their source
library(tidyverse)
library(httr2)
library(janitor)
"https://web.archive.org/web/timemap/json?url=https://www.bjjcompsystem.com/tournaments/1869/categories&matchType=prefix&collapse=urlkey&output=json&fl=original,mimetype,timestamp,endtimestamp,groupcount,uniqcount&filter=!statuscode:[45]..&limit=10000&_=1663136483842" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
as_tibble() %>%
row_to_names(1)
# A tibble: 784 × 6
original mimet…¹ times…² endti…³ group…⁴ uniqc…⁵
<chr> <chr> <chr> <chr> <chr> <chr>
1 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 3 3
2 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 6 6
3 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 2 2
4 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1 1
5 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 2 2
6 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1 1
7 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1 1
8 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 2 2
9 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1 1
10 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1 1
# … with 774 more rows, and abbreviated variable names ¹mimetype, ²timestamp, ³endtimestamp,
# ⁴groupcount, ⁵uniqcount
# ℹ Use `print(n = ...)` to see more rows