As a part of Google DA certificate assignment I was trying to find an elegant solution on how to download unzip and merge multiple .csv files using R, but I keep facing same issue over an over again:
error 1 in extracting from zip file
Data: Source: Divvy
the code I run is:
## declare variable file names corresponding to calendar months
months <- c(202011:202012,202101:202110)
## declare directory for storing source files
storage <- "C:\\Users\\...\\start"
## vectors of all urls to download from and destination files
urls <-
paste0("https://divvy-tripdata.s3.amazonaws.com/",months, "-divvy-tripdata.zip")
## idea was to download archives into temporary files, unzip contents to 'storage' directory and remove tempdir.
temp <- tempdir()
tempfile <- paste0(temp,"\\",months,".zip")
##Downloading 12 months archives
for(i in seq(urls)){
download.file(urls[i],tempfile[i], mode="wb")
}
file_names <- list.files(temp, pattern = ".zip")
for (i in seq(file_names)){
unzip(file_names,exdir=storage,overwrite = FALSE)}
Warning in unzip("file_names", exdir = storage, overwrite = FALSE) : error 1 in extracting from zip file
Everything works until unzip step. All archives are downloaded, can be opened, files are not corrupt, properties shows extension as .zip
I've tried my code on multiple machines, within different directories, tried downloading archives manually, tried unzipping each individual and all at once using loops and ldply
still same result.
I've spent 3 days trying to solve it and appreciate any help :)
CodePudding user response:
With the same months
, and urls
variables, the following seems simpler. Note the different way of putting together the temp file names, with file.path
.
tmpdir <- tempdir()
tmpfile <- file.path(tmpdir, months)
tmpfile <- paste0(tmpfile, ".zip")
##Downloading 12 months archives
for(i in seq(urls)){
download.file(urls[i], tmpfile[i], mode="wb")
unzip(tmpfile[i], exdir = storage, overwrite = FALSE)
}
unlink(tmfile)
unlink(tmpdir)
list.files(storage, pattern = "\\.csv")
# [1] "202011-divvy-tripdata.csv" "202012-divvy-tripdata.csv"
# [3] "202101-divvy-tripdata.csv" "202102-divvy-tripdata.csv"
# [5] "202103-divvy-tripdata.csv" "202104-divvy-tripdata.csv"
# [7] "202105-divvy-tripdata.csv" "202106-divvy-tripdata.csv"
# [9] "202107-divvy-tripdata.csv" "202108-divvy-tripdata.csv"
#[11] "202109-divvy-tripdata.csv" "202110-divvy-tripdata.csv"