Home > Back-end >  Text files being downloaded from zipped files, accessed via urls, being downloaded to working direct
Text files being downloaded from zipped files, accessed via urls, being downloaded to working direct

Time:06-03

I am trying to retrieve multiple data txt files, that match a certain pattern, form multiple zipped files that I access through urls. I wrote a script that downloads the desired dataframe files from the url, saving them in a list, then rbinds all the dataframes together. I then sapply the function over a list of urls.

My desired end result is to have all the downloaded data form all urls in a single dataframe in the global environment in R.

Currently however, the individual files get downloaded into my working directory, which I don't want, and are not combined into a single dataframe. I'm wondering whether this problem stems from download.file, but I have been unable to find a solution or posts with similar issues.

# list of urls
url_df = data.frame(model = c("rcp26", "rcp45", "rcp85"),  
                    url = c("https://b2share.eudat.eu/api/files/d4850267-3ce2-44f4-b5e3-8391a4f3dc27/LTER_site_data_from_EURO-CORDEX-RCMs_rel1.see_disclaimer.77c127c4-2ebe-453b-b5af-61858ff02e31.huss_historical_rcp26_day_txt.zip",
"https://b2share.eudat.eu/api/files/d4850267-3ce2-44f4-b5e3-8391a4f3dc27/LTER_site_data_from_EURO-CORDEX-RCMs_rel1.see_disclaimer.77c127c4-2ebe-453b-b5af-61858ff02e31.huss_historical_rcp45_day_txt.zip",
"https://b2share.eudat.eu/api/files/d4850267-3ce2-44f4-b5e3-8391a4f3dc27/LTER_site_data_from_EURO-CORDEX-RCMs_rel1.see_disclaimer.77c127c4-2ebe-453b-b5af-61858ff02e31.huss_historical_rcp85_day_txt.zip"))

# create empty dataframe where data will be saved
downloaded_data = data.frame()

# create function to retrieve desired files from a single url
get_data = function(url) {
  temp <- tempfile() # create temp file
  download.file(url,temp) # download file contained in the url
  
  # get a list of the desired files
  file.list <- grep("KNMI-RACMO22E.*txt|MPI-CSC-REMO.*txt|SMHI-RCA4.*txt", unzip(temp, list=TRUE)$Name, ignore.case=TRUE, value=TRUE)
  
  data.list = lapply(unzip(temp, files=file.list), read.table, header=FALSE,  comment.char = "", check.names = FALSE)
  
  # bind the dataframes in the list into one single dataframe
  bound_data = dplyr::bind_rows(data.list)
  
  downloaded_data = rbind(downloaded_data, bound_data )
  
  return(downloaded_data)
  
  unlink(temp)
}

# apply function over the list of urls
sapply(url_df$url, get_data)

Any help would be greatly appreciated!

CodePudding user response:

You can't refer to downloaded_data within the function -- the function will be applied to each URL separately, and then you can bind them together to create downloaded_data. There were also some changes to the unzipping and reading in of the data to make sure the files were actually being read in.

# list of urls
url_df = data.frame(model = c("rcp26", "rcp45", "rcp85"),  
                    url = c("https://b2share.eudat.eu/api/files/d4850267-3ce2-44f4-b5e3-8391a4f3dc27/LTER_site_data_from_EURO-CORDEX-RCMs_rel1.see_disclaimer.77c127c4-2ebe-453b-b5af-61858ff02e31.huss_historical_rcp26_day_txt.zip",
                            "https://b2share.eudat.eu/api/files/d4850267-3ce2-44f4-b5e3-8391a4f3dc27/LTER_site_data_from_EURO-CORDEX-RCMs_rel1.see_disclaimer.77c127c4-2ebe-453b-b5af-61858ff02e31.huss_historical_rcp45_day_txt.zip",
                            "https://b2share.eudat.eu/api/files/d4850267-3ce2-44f4-b5e3-8391a4f3dc27/LTER_site_data_from_EURO-CORDEX-RCMs_rel1.see_disclaimer.77c127c4-2ebe-453b-b5af-61858ff02e31.huss_historical_rcp85_day_txt.zip"))

# create function to retrieve desired files from a single url
get_data = function(url) {
  temp <- tempdir() # create temp file
  download.file(url, file.path(temp, "downloaded.zip")) # download file contained in the url
  downloaded_files <- unzip(file.path(temp, "downloaded.zip"), exdir = temp)
  keep_files <- downloaded_files[grep("KNMI-RACMO22E.*txt|MPI-CSC-REMO.*txt|SMHI-RCA4.*txt", 
                                      downloaded_files)]
  data.list <- lapply(keep_files, read.table, header=FALSE,  comment.char = "", check.names = FALSE)
  # bind the dataframes in the list into one single dataframe
  bound_data = dplyr::bind_rows(data.list)
  return(bound_data)
  unlink(temp)
}

# apply function over the list of urls
downloaded_data <- dplyr::bind_rows(lapply(url_df$url, get_data))
dim(downloaded_data)
#> [1] 912962      7
  • Related