Home > Software design >  How to conditionally name pdf's saved in a for loop in R
How to conditionally name pdf's saved in a for loop in R

Time:12-08

I am scraping a website and I wish to save a large amount (1000 ) PDFs through R.

Below is a subset of my data:

head_data <- structure(list(url_pdf = c("https://projekter.aau.dk/projekter/files/415527824/Speciale_2021.pdf", 
"https://projekter.aau.dk/projekter/files/415526224/FARDIG_SPECIALE_2_0__PDF.pdf", 
"https://projekter.aau.dk/projekter/files/437213254/Den_Almene_Udfordring.pdf", 
"https://projekter.aau.dk/projekter/files/415460040/Speciale_sociologi_NannaDyhr_MaleneScholer.pdf", 
"https://projekter.aau.dk/projekter/files/420992851/Katrine_B._Dethlefsen__speciale_F2021.pdf", 
"https://projekter.aau.dk/projekter/files/407804447/Speciale_2021_Katrine_May_Duus.pdf"
), id = c("413008245", "413011720", "432811291", "413009078", 
"413050128", "405494084")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

which produces

 url_pdf                                                                                           id       
  <chr>                                                                                             <chr>    
1 https://projekter.aau.dk/projekter/files/415527824/Speciale_2021.pdf                              413008245
2 https://projekter.aau.dk/projekter/files/415526224/FARDIG_SPECIALE_2_0__PDF.pdf                   413011720
3 https://projekter.aau.dk/projekter/files/437213254/Den_Almene_Udfordring.pdf                      432811291
4 https://projekter.aau.dk/projekter/files/415460040/Speciale_sociologi_NannaDyhr_MaleneScholer.pdf 413009078
5 https://projekter.aau.dk/projekter/files/420992851/Katrine_B._Dethlefsen__speciale_F2021.pdf      413050128
6 https://projekter.aau.dk/projekter/files/407804447/Speciale_2021_Katrine_May_Duus.pdf             405494084

Based on this, I wish to download each pdf from the url_pdf column to a subfolder /pdfs and conditionally name them as equal to the column id.

I can do this for each element using the following code:

for (url in head_data$url_pdf){ download.file(url, destfile = basename(url), mode = "wb") }

However, this names the PDF's after the name of the PDF and I wish it to be named after id. Likewise, this saves to the working directory and not within the subfolder.

Any help is greatly appreciated!

CodePudding user response:

for(i in 1:nrow(head_data)){
  #retrieve url from head_data
  url <- head_data[i,]$url_pdf
  
  #create filename, including  .pdf and subfolder (pdfs/)
  filename <- paste0("pdfs/", head_data[i,]$id, ".pdf")
  
  #check if file already exists, if not download and safe
  if(!file.exists(filename)){
    download.file(url, destfile = filename, mode = "wb")
  }
}
  • Related