I have this user defined function that uses the rvest package to get downloadable files from a web page.
GetFluDataFiles <- function(URL = "https://www1.health.gov.au/internet/main/publishing.nsf/Content/ohp-pub-datasets.htm",
REMOVE_URL_STRING = "ohp-pub-datasets.htm/",
DEBUG = TRUE){
if(DEBUG) message("GetFluDataFiles: Function initialized \n")
FUNCTION_OUTPUT <- list()
FUNCTION_OUTPUT[["URL"]] <- URL
page <- rvest::read_html(URL)
if(DEBUG) message("GetFluDataFiles: Get all downloadable files on webpage \n")
all_downloadable_files <- page %>%
rvest::html_nodes("a") %>%
rvest::html_attr("href") %>%
str_subset("\\.xlsx")
# all_downloadable_files
FUNCTION_OUTPUT[["ALL_DOWNLOADABLE_FILES"]] <- all_downloadable_files
if(DEBUG) message("GetFluDataFiles: Get all downloadable files on webpage which contain flu data \n")
influenza_file <- all_downloadable_files[tolower(all_downloadable_files) %like% c("influenza")]
# influenza_file
FUNCTION_OUTPUT[["FLU_FILE"]] <- influenza_file
file_path = file.path(URL, influenza_file)
# file_path
FUNCTION_OUTPUT[["FLU_FILE_PATH"]] <- file_path
if(DEBUG) message("GetFluDataFiles: Collect final path \n")
if(!is.null(REMOVE_URL_STRING)){
full_final_path <- gsub(REMOVE_URL_STRING, "", file_path)
} else {
full_final_path <- file_path
}
FUNCTION_OUTPUT[["FULL_FINAL_PATH"]] <- full_final_path
if(!is.na(full_final_path) | !is.null(full_final_path)){
if(DEBUG) message("GetFluDataFiles: Function run completed \n")
return(FUNCTION_OUTPUT)
} else {
stop("GetFluDataFiles: Folders not created \n")
}
}
I've used this function to extract the data that I want
Everything seems to work... I am able to download the file.
> output <- GetFluDataFiles()
GetFluDataFiles: Function initialized
GetFluDataFiles: Get all downloadable files on webpage
GetFluDataFiles: Get all downloadable files on webpage which contain flu data
GetFluDataFiles: Collect final path
GetFluDataFiles: Function run completed
> output$FULL_FINAL_PATH
[1] "https://www1.health.gov.au/internet/main/publishing.nsf/Content/C4DDC0B448F04792CA258728001EC5D0/$File/x.Influenza-laboratory-confirmed-Public-datset-2008-2019.xlsx"
> download.file(output$FULL_FINAL_PATH, destfile = "myfile.xlsx")
trying URL 'https://www1.health.gov.au/internet/main/publishing.nsf/Content/C4DDC0B448F04792CA258728001EC5D0/$File/x.Influenza-laboratory-confirmed-Public-datset-2008-2019.xlsx'
Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 27134133 bytes (25.9 MB)
downloaded 25.9 MB
And the file exists.
> file.exists("myfile.xlsx")
[1] TRUE
But when I go to import the xlsx file, this error pops up.
> library("readxl")
> my_data <- read_excel("myfile.xlsx", sheet = 1, skip = 1)
Error: Evaluation error: error -103 with zipfile in unzGetCurrentFileInfo
What is this error? How can I resolve it?
CodePudding user response:
Set download option to curl
download.file(output$FULL_FINAL_PATH, destfile = "myfile.xlsx", method = 'curl')
my_data <- read_excel("myfile.xlsx", sheet = 1, skip = 1)