I am trying to scrape the href from the 'Printer-Friendly Minutes' link on this website using Selector gadget. Usually works, but this time I'm just getting an empty character in place of the href I'm trying to grab.
Here's the code:
url <- "http://www.richmond.ca/cityhall/council/agendas/council/2021/012521_minutes.htm"
try <- url %>% read_html %>% html_nodes(".first-child a") %>% html_attr("href")
Anyone know what might be going wrong?
CodePudding user response:
As PFM is used as the abbreviation for the minutes you can target the href by that substring
library(rvest)
library(magrittr)
url <- "http://www.richmond.ca/cityhall/council/agendas/council/2021/012521_minutes.htm"
read_html(url) %>%
html_element('[href*=PFM]') %>%
html_attr('href')
You could also use its adjacent sibling relationship to the preceedingimg
tag, which can be nicely targeted by its alt
attribute value:
read_html(url) %>%
html_element('[alt="PDF Document"] a') %>%
html_attr('href')
CodePudding user response:
I think you have just not selected the node correctly. It's really helpful to learn xpath, which allows precise node navigation in html:
library(rvest)
domain <- "http://www.richmond.ca"
url <- paste0(domain, "/cityhall/council/agendas/council/2021/012521_minutes.htm")
pdf_url <- url %>%
read_html %>%
html_nodes(xpath = "//a[@title='PFM_CNCL_012521']") %>%
html_attr("href") %>%
paste0(domain, .)
pdf_url
#> [1] "http://www.richmond.ca/__shared/assets/PFM_CNCL_01252157630.pdf"
We can see this is a valid link by GETting the result:
httr::GET(pdf_url)
#> Response [https://www.richmond.ca/__shared/assets/PFM_CNCL_01252157630.pdf]
#> Date: 2021-10-18 20:35
#> Status: 200
#> Content-Type: application/pdf
#> Size: 694 kB
#> <BINARY BODY>
Created on 2021-10-18 by the reprex package (v2.0.0)