I am trying to scrape multiple pages with rvest
. However, the link I get through html_attr("href")
is incomplete. The intial part of the link unfortunately changes across pages in a way that I am unable to understand. Do you know if there is a solution? Thank you.
These are two examples of the website. The part that changes across pages seems to be "/sk####". (I am interested in the links to the Relazione and Testo articoli pages.
http://leg14.camera.it/_dati/leg14/lavori/stampati/sk5000/frontesp/4543.htm
http://leg14.camera.it/_dati/leg14/lavori/stampati/sk4500/frontesp/4477.htm
df <- structure(list(date = c(20010618L, 20010618L, 20010618L, 20010618L,
20010618L), link = c("http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=814",
"http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=858",
"http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=875",
"http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=802",
"http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=816"
)), row.names = c(NA, 5L), class = "data.frame")
df$linkfinal<- pbsapply(df$link, function(x) {
tryCatch({
x %>%
read_html() %>%
html_nodes('td td a') %>%
html_attr("href") %>%
toString()
}, error = function(e) NA)
})
CodePudding user response:
You simply need to capture the re-direct url (use httr
), then swop out the string frontesp
with either articola
or relazion
. If you first use these same substrings to test if the href containing this is present, you can leverage ifelse to either do the url substitution described above or return NA.
There are faster ways of applying a function if working with large numbers of rows. I was just interested in this approach after reading about it here: https://blog.az.sg/posts/map-and-walk/.
library(tidyverse)
library(httr)
df <- structure(list(date = c(
20010618L, 20010618L, 20010618L, 20010618L,
20010618L
), link = c(
"http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=814",
"http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=858",
"http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=875",
"http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=802",
"http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=816"
)), row.names = c(NA, 5L), class = "data.frame")
get_link <- function(url, page, url_sub_string) {
link <- page %>%
html_element(sprintf("[href*=%s]", url_sub_string)) %>%
html_attr("href")
link <- ifelse(is.na(link), link, gsub("frontesp", url_sub_string, url))
return(link)
}
df <- df %>%
pmap_dfr(function(...) {
current <- tibble(...)
r <- GET(current$link)
page <- r %>% read_html()
redirect_link <- r$url
current %>%
mutate(
articola = get_link(redirect_link, page, "articola"),
relazion = get_link(redirect_link, page, "relazion")
)
})