Downloading and storing multiple files from URLs on R; skipping urls that are empty-CodePudding

Thanks in advance for any feedback.

As part of my dissertation I'm trying to scrape data from the web (been working on this for months). I have a couple issues:

-Each document I want to scrape has a document number. However, the numbers don't always go up in order. For example, one document number is 2022, but the next one is not necessarily 2023, it could be 2038, 2040, etc. I don't want to hand go through to get each document number. I have tried to wrap download.file in purrr::safely(), but once it hits a document that does not exist it stops. -Second, I'm still fairly new to R, and am having a hard time setting up destfile for multiple documents. Indexing the path for where to store downloaded data ends up with the first document stored in the named place, the next document as NA.

Here's the code I've been working on:

base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"

#document.number <- 2321
document.numbers <- c(2330:2333)

for (i in 1:length(document.numbers)) {

temp.doc.name <- paste0(base.url,
document.name.1,
document.numbers[i],
document.extension)
print(temp.doc.name)

#download and save data
safely <- purrr::safely(download.file(temp.doc.name,
destfile = "/Users/...[i]"))

}

Ultimately, I need to scrape about 120,000 documents from the site. Where is the best place to store the data? I'm thinking I might run the code for each of the 15 years I'm interested in separately, in order to (hopefully) keep it manageable.

Note: I've tried several different ways to scrape the data. Unfortunately for me, the RSS feed only has the most recent 25. Because there are multiple dropdown menus to navigate before you reach the .docx file, my workaround is to use document numbers. I am however, open to more efficient way to scrape these written questions.

Again, thanks for any feedback!

Kari

CodePudding user response：

If the file does not exists, tryCatch simply skips it

library(tidyverse)

get_data <- function(index) {
  paste0(
    "https://www.europarl.europa.eu/doceo/document/",
    "P-9-2022-00",
    index,
    "_EN.docx"
  ) %>%
    download.file(url = .,
                  destfile = paste0(index, ".docx"),
                  mode = "wb", 
                  quiet = TRUE) %>% 
    tryCatch(., 
             error = function(e) print(paste(index, "does not exists - SKIPS")))
    
}

map(2000:5000, get_data)

CodePudding user response：

After quickly checking out the site, I agree that I can't see any easier ways to do this, because the search function doesn't appear to be URL-based. So what you need to do is poll each candidate URL and see if it returns a "good" status (usually 200) and don't download when it returns a "bad" status (like 404). The following code block does that.

Note that purrr::safely doesn't run a function -- it creates another function that is safe and which you then can call. The created function returns a list with two slots: result and error.

base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"

#document.number <- 2321
document.numbers <- c(2330:2333,2552,2321)

sHEAD = purrr::safely(httr::HEAD)
sdownload = purrr::safely(download.file)

for (i in seq_along(document.numbers)) {

    file_name = paste0(document.name.1,document.numbers[i],document.extension)
    temp.doc.name <- paste0(base.url,file_name)
    print(temp.doc.name)
    
    print(sHEAD(temp.doc.name)$result$status)

    if(sHEAD(temp.doc.name)$result$status %in% 200:299){
        
        sdownload(temp.doc.name,destfile=file_name)
    }

}

It might not be as simple as all of the valid URLs returning a '200' status. I think in general URLs in the range 200:299 are ok (edited answer to reflect this).

I used parts of this answer in my answer.