I'm using RSelenium to get the page source off the archive.org website so I can scrape the links using rvest.
library(rvest); library(tidyverse);
library(RSelenium); library(netstat)
remote_driver = rsDriver(browser = 'firefox',
verbose = F,
port = free_port())
rd = remote_driver$client
rd$open()
rd$navigate('https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories')
rd$maxWindowSize()
html = read_html(rd$getPageSource()[[1]])
get_links <- html %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://web.archive.org', .)
It successfully scrapes the link of the original website, but misses the portion belonging to archive.org.
This is what the first example returns:
https://web.archive.orghttp://www.bjjcompsystem.com/tournaments/1869/categories/2053146
But it's missing the the unique identifier:
/web/20220913024354/
This is what the full link should look like: https://web.archive.org/web/20220913024354/https://www.bjjcompsystem.com/tournaments/1869/categories/2053146
How do I get the missing portion??
What the scraped links should look like:
etc.
CodePudding user response:
I am not sure what you mean. Like this?
library(tidyverse)
library(rvest)
"https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories" %>%
read_html() %>%
html_elements(".categories-grid__category a") %>%
html_attr("href") %>%
paste0("https://web.archive.org", .)
[1] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053146"
[2] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053150"
[3] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053154"
[4] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053158"
[5] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053162"
[6] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053166"