I am scraping a website and I wish to save a large amount (1000 ) PDFs through R.
Below is a subset of my data:
head_data <- structure(list(url_pdf = c("https://projekter.aau.dk/projekter/files/415527824/Speciale_2021.pdf",
"https://projekter.aau.dk/projekter/files/415526224/FARDIG_SPECIALE_2_0__PDF.pdf",
"https://projekter.aau.dk/projekter/files/437213254/Den_Almene_Udfordring.pdf",
"https://projekter.aau.dk/projekter/files/415460040/Speciale_sociologi_NannaDyhr_MaleneScholer.pdf",
"https://projekter.aau.dk/projekter/files/420992851/Katrine_B._Dethlefsen__speciale_F2021.pdf",
"https://projekter.aau.dk/projekter/files/407804447/Speciale_2021_Katrine_May_Duus.pdf"
), id = c("413008245", "413011720", "432811291", "413009078",
"413050128", "405494084")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
which produces
url_pdf id
<chr> <chr>
1 https://projekter.aau.dk/projekter/files/415527824/Speciale_2021.pdf 413008245
2 https://projekter.aau.dk/projekter/files/415526224/FARDIG_SPECIALE_2_0__PDF.pdf 413011720
3 https://projekter.aau.dk/projekter/files/437213254/Den_Almene_Udfordring.pdf 432811291
4 https://projekter.aau.dk/projekter/files/415460040/Speciale_sociologi_NannaDyhr_MaleneScholer.pdf 413009078
5 https://projekter.aau.dk/projekter/files/420992851/Katrine_B._Dethlefsen__speciale_F2021.pdf 413050128
6 https://projekter.aau.dk/projekter/files/407804447/Speciale_2021_Katrine_May_Duus.pdf 405494084
Based on this, I wish to download each pdf from the url_pdf
column to a subfolder /pdfs
and conditionally name them as equal to the column id
.
I can do this for each element using the following code:
for (url in head_data$url_pdf){ download.file(url, destfile = basename(url), mode = "wb") }
However, this names the PDF's after the name of the PDF and I wish it to be named after id
. Likewise, this saves to the working directory and not within the subfolder.
Any help is greatly appreciated!
CodePudding user response:
for(i in 1:nrow(head_data)){
#retrieve url from head_data
url <- head_data[i,]$url_pdf
#create filename, including .pdf and subfolder (pdfs/)
filename <- paste0("pdfs/", head_data[i,]$id, ".pdf")
#check if file already exists, if not download and safe
if(!file.exists(filename)){
download.file(url, destfile = filename, mode = "wb")
}
}