Home > Back-end >  R code for downloading all the pdfs given on a site: Web scraping
R code for downloading all the pdfs given on a site: Web scraping

Time:10-28

I want to code in R which can download all the pdfs given on this URL: https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook of Statistics on Indian Economy and then download all the pdfs in a folder. I tried the following code with the help of https://towardsdatascience.com but the code is erroring out as

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx? 
head=Handbook of Statistics on Indian Economy") %>%

raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>%  # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://rbi.org.in", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.rbi.org.in", .) %>% # prepend the website again to get a full url
for (url in raw_list)
{ download.file(url, destfile = basename(url), mode = "wb") 
} 

I am not able to interpret why is the code erroring out. If someone can help me.

CodePudding user response:

there were small mistakes. the Website uses capital letters for PDF endings, and you don't need to use str_c("https://rbi.org.in", .). Finally, I think using purrr's walk2 functions is smoother (as it was probably in the original code).

I haven't executed the code, cos I don't need so many pdfs, so, report if it works.

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook of Statistics on Indian Economy")
  
  raw_list <- page %>% # takes the page above for which we've read the html
  html_nodes("a") %>%  # find all links in the page
  html_attr("href") %>% # get the url for these links
  str_subset("\\.PDF") %>% 
  walk2(., basename(.), download.file, mode = "wb")

CodePudding user response:

When trying to run your code, I ran into "Verify that you are a human" and "Please ensure that your browser has Javascript enabled" dialogues. This suggests that you cannot open the page using Rvest but you need to use RSelenium browser automation instead.

Here is a modified version using RSelenium

library(tidyverse)
library(stringr)
library(purrr)
library(rvest)

library(RSelenium)

rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]

remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook of Statistics on Indian Economy")
page <- remDr$getPageSource()[[1]]
read_html(page) -> html

html %>%
html_nodes("a") %>%  
html_attr("href") %>% 
str_subset("\\.PDF") -> urls
urls %>% str_split(.,'/') %>% unlist() %>% str_subset("\\.PDF") -> filenames

for(u in 1:length(urls)) {
 cat(paste('downloading: ', u, ' of ', length(urls) '\n'))
 download.file(urls[u], filenames[u], mode='wb')
 Sys.sleep(1)
}
  • Related