I am attempting to scrape the World Health Organization website (https://www.who.int/publications/m) >> using the "WHO document type" dropdown for "Press Briefing transcript".
In the past Ive been able to use the following script to download all specified file types to the working directory, however I haven't been able to deal with the drop down properly.
# Working example
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.github.com/rstudio/cheatsheets")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://www.github.com", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.github.com", .) %>% # prepend the website again to get a full url
walk2(., basename(.), download.file, mode = "wb") # use purrr to download the pdf associated with each url to the current working directory
If I start with the below. What steps would I need to include to account for the "WHO document type" dropdown for "Press Briefing transcript" and DL all files to the working directory?
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.who.int/publications/m")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://www.who.int", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.who.int", .) %>% # prepend the website again to get a full url
walk2(., basename(.), download.file, mode = "wb") # use purrr to download the pdf associated with each url to the current working directory
Currently, I get the following:
Error in .f(.x\[\[1L\]\], .y\[\[1L\]\], ...) : cannot open URL 'NA'
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.who.int/publications/m")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://www.who.int", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.who.int", .) %>% # prepend the website again to get a full url
walk2(., basename(.), download.file, mode = "wb") # use purrr to download the pdf associated with each url to the current working directory
Results PDFs downloaded to working directory
CodePudding user response:
There's not much to do with rvest, that document list is not included in the page's source (that rvest could access) but pulled by javascript that is executed by the browser (and rvest can't do that). Though you can make those same calls yourself:
library(jsonlite)
library(dplyr)
library(purrr)
library(stringr)
# get list of reports, partial API documentation can be found
# at https://www.who.int/api/hubs/meetingreports/sfhelp
# additional parameters (i.e. select & filter) recovered from who.int web requests
# skip: number of articles to skip
get_reports <- function(skip = 0){
read_json(URLencode(paste0("https://www.who.int/api/hubs/meetingreports?",
"$select=TrimmedTitle,PublicationDateAndTime,DownloadUrl,Tag&",
"$filter=meetingreporttypes/any(x:x eq f6c6ebea-eada-4dcb-bd5e-d107a357a59b)&",
"$orderby=PublicationDateAndTime desc&",
"$count=true&",
"$top=100&",
"$skip=", skip
)), simplifyVector = T) %>%
pluck("value") %>%
tibble()
}
# make 2 requests to collect all current (164) reports ("...&skip=0", "...&skip=100")
report_urls <- map_dfr(c(0,100), ~ get_reports(.x))
report_urls
#> # A tibble: 164 × 4
#> PublicationDateAndTime TrimmedTitle Downl…¹ Tag
#> <chr> <chr> <chr> <chr>
#> 1 2023-01-24T19:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 2 2023-01-11T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 3 2023-01-04T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 4 2022-12-21T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 5 2022-12-02T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 6 2022-11-16T16:00:00Z COVID-19, Monkeypox & Other Global Heal… https:… Pres…
#> 7 2022-11-10T22:00:00Z WHO press conference on global health i… https:… Pres…
#> 8 2022-10-19T21:00:00Z WHO press conference on global health i… https:… Pres…
#> 9 2022-10-19T21:00:00Z WHO press conference on global health i… https:… Pres…
#> 10 2022-10-12T21:00:00Z WHO press conference on COVID-19, monke… https:… Pres…
#> # … with 154 more rows, and abbreviated variable name ¹DownloadUrl
# get 1st 3 transcripts, for destfiles plit url by "?", take the 1st part, use basename to extract file name from url
walk(report_urls$DownloadUrl[1:3],
~ download.file(
url = .x,
destfile = basename(str_split_i(.x, "\\?", 1)),mode = "wb"))
# list downloaded files
list.files(pattern = "press.*pdf")
#> [1] "who-virtual-press-conference-on-global-health-issues-11-jan-2023.pdf"
#> [2] "virtual-press-conference-on-global-health-issues-24-january-2023.pdf"
#> [3] "virtual-press-conference-on-global-health-issues_4-january-2023.pdf"
Created on 2023-01-28 with reprex v2.0.2
That "working example" in question comes from https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199 , it is rather difficult to take and apply anything from that article unless you are already familiar with everything written there. To understand why applying scraping logic built for one site almost never works for another, maybe check https://rvest.tidyverse.org/articles/rvest.html & https://r4ds.hadley.nz/webscraping.html (both from rvest author).