Home > Software design >  Opening an xls file downloaded from a website
Opening an xls file downloaded from a website

Time:11-19

I have this user defined function that uses the rvest package to get downloadable files from a web page.

GetFluDataFiles <- function(URL = "https://www1.health.gov.au/internet/main/publishing.nsf/Content/ohp-pub-datasets.htm",    
                            REMOVE_URL_STRING = "ohp-pub-datasets.htm/",    
                            DEBUG = TRUE){

    if(DEBUG) message("GetFluDataFiles: Function initialized  \n")

    FUNCTION_OUTPUT <- list()

    FUNCTION_OUTPUT[["URL"]] <- URL

    page <- rvest::read_html(URL)

    if(DEBUG) message("GetFluDataFiles: Get all downloadable files on webpage  \n")

    all_downloadable_files <- page %>%    
                                rvest::html_nodes("a") %>%      
                                rvest::html_attr("href") %>%      
                                str_subset("\\.xlsx")    
    # all_downloadable_files

    FUNCTION_OUTPUT[["ALL_DOWNLOADABLE_FILES"]] <- all_downloadable_files

    if(DEBUG) message("GetFluDataFiles: Get all downloadable files on webpage which contain flu data  \n")

    influenza_file <- all_downloadable_files[tolower(all_downloadable_files) %like% c("influenza")]    
    # influenza_file   

    FUNCTION_OUTPUT[["FLU_FILE"]] <- influenza_file

    file_path = file.path(URL, influenza_file)    
    # file_path

    FUNCTION_OUTPUT[["FLU_FILE_PATH"]] <- file_path

    if(DEBUG) message("GetFluDataFiles: Collect final path  \n")

    if(!is.null(REMOVE_URL_STRING)){    
        full_final_path <- gsub(REMOVE_URL_STRING, "", file_path)    
    } else {    
        full_final_path <- file_path    
    }

    FUNCTION_OUTPUT[["FULL_FINAL_PATH"]] <- full_final_path

    if(!is.na(full_final_path) | !is.null(full_final_path)){
        if(DEBUG) message("GetFluDataFiles: Function run completed  \n")

        return(FUNCTION_OUTPUT)
    } else {
         stop("GetFluDataFiles: Folders not created  \n")    
    }

}

I've used this function to extract the data that I want

Everything seems to work... I am able to download the file.

> output <- GetFluDataFiles()

GetFluDataFiles: Function initialized 

GetFluDataFiles: Get all downloadable files on webpage 

GetFluDataFiles: Get all downloadable files on webpage which contain flu data 

GetFluDataFiles: Collect final path 

GetFluDataFiles: Function run completed 

> output$FULL_FINAL_PATH 
[1] "https://www1.health.gov.au/internet/main/publishing.nsf/Content/C4DDC0B448F04792CA258728001EC5D0/$File/x.Influenza-laboratory-confirmed-Public-datset-2008-2019.xlsx"

> download.file(output$FULL_FINAL_PATH, destfile = "myfile.xlsx") 
trying URL 'https://www1.health.gov.au/internet/main/publishing.nsf/Content/C4DDC0B448F04792CA258728001EC5D0/$File/x.Influenza-laboratory-confirmed-Public-datset-2008-2019.xlsx'

Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 27134133 bytes (25.9 MB)

downloaded 25.9 MB

And the file exists.

> file.exists("myfile.xlsx")  
[1] TRUE

But when I go to import the xlsx file, this error pops up.

> library("readxl")

> my_data <- read_excel("myfile.xlsx", sheet = 1, skip = 1)

Error: Evaluation error: error -103 with zipfile in unzGetCurrentFileInfo

What is this error? How can I resolve it?

CodePudding user response:

Set download option to curl

download.file(output$FULL_FINAL_PATH, destfile = "myfile.xlsx", method = 'curl') 
my_data <- read_excel("myfile.xlsx", sheet = 1, skip = 1)
  • Related