Home > Software engineering >  encoding problems in dowloading files scrapped from a webpage
encoding problems in dowloading files scrapped from a webpage

Time:09-05

I have scrapped quite a few url addresses to download files from internet. Most of them work fine, for example:

url1 <- "http://www.catastro.minhap.es/INSPIRE/CadastralParcels/02/02006-ALCADOZO/A.ES.SDGC.CP.02006.zip"
download.file(url1, destfile = "A.ES.SDGC.CP.02006.zip", quiet = TRUE)

Works fine, but

url2 <- ""http://www.catastro.minhap.es/INSPIRE/CadastralParcels/02/02007-ALCALA DEL JUCAR/A.ES.SDGC.CP.02007.zip""
download.file(url2, destfile = "A.ES.SDGC.CP.02007.zip", quiet = TRUE)

fails

in download.file(municipio, destfile = filename, quiet = TRUE) : 
  cannot open URL 'http://www.catastro.minhap.es/INSPIRE/CadastralParcels/02/02007-ALCALA DEL JUCAR/A.ES.SDGC.CP.02007.zip'
In addition: Warning message:
In download.file(municipio, destfile = filename, quiet = TRUE) :
  URL 'http://www.catastro.minhap.es/INSPIRE/CadastralParcels/02/02007-ALCALA DEL JUCAR/A.ES.SDGC.CP.02007.zip': status was 'URL using bad/illegal format or missing URL'

I know the problem is with the white spaces and the encoding (same happens with other characters, like Ñ). But I have been unable to solve it forcing a windows encoding, "windows-1252", in the url address.

curl::curl_download doesn`t solve the problem.

Curiously, if I Copy & Paste the url in the brownser, everything works fine, and I can download the file.

Any help would be appreciated.

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8    LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
[5] LC_TIME=Spanish_Spain.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rvest_1.0.3     forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10    purrr_0.3.4     readr_2.1.2     tidyr_1.2.0    
 [8] tibble_3.1.8    ggplot2_3.3.6   tidyverse_1.3.2

loaded via a namespace (and not attached):
 [1] pillar_1.8.1        compiler_4.2.1      cellranger_1.1.0    dbplyr_2.2.1        tools_4.2.1         lubridate_1.8.0    
 [7] jsonlite_1.8.0      googledrive_2.0.0   lifecycle_1.0.1     gargle_1.2.0        gtable_0.3.1        pkgconfig_2.0.3    
[13] rlang_1.0.5         reprex_2.0.2        DBI_1.1.3           cli_3.3.0           rstudioapi_0.14     curl_4.3.2         
[19] haven_2.5.1         xml2_1.3.3          withr_2.5.0         httr_1.4.4          hms_1.1.2           generics_0.1.3     
[25] vctrs_0.4.1         fs_1.5.2            tictoc_1.0.1        googlesheets4_1.0.1 grid_4.2.1          tidyselect_1.1.2   
[31] glue_1.6.2          R6_2.5.1            fansi_1.0.3         readxl_1.4.1        selectr_0.4-2       tzdb_0.3.0         
[37] modelr_0.1.9        magrittr_2.0.3      ellipsis_0.3.2      backports_1.4.1     scales_1.2.1        assertthat_0.2.1   
[43] colorspace_2.0-3    utf8_1.2.2          stringi_1.7.8       munsell_0.5.0       broom_1.0.1         crayon_1.5.1    

Windows encoding:

[System.Text.Encoding]::Default

IsSingleByte      : True
BodyName          : iso-8859-1
EncodingName      : Europeo occidental (Windows)
HeaderName        : Windows-1252
WebName           : Windows-1252
WindowsCodePage   : 1252
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
EncoderFallback   : System.Text.InternalEncoderBestFitFallback
DecoderFallback   : System.Text.InternalDecoderBestFitFallback
IsReadOnly        : True
CodePage          : 1252

CodePudding user response:

Your url2 string contains spaces, should be percent-encoded (read Details in download.file):

url2 <- "http://www.catastro.minhap.es/INSPIRE/CadastralParcels/02/02007-ALCALA DEL JUCAR/A.ES.SDGC.CP.02007.zip"
download.file(URLencode(url2), destfile = "A.ES.SDGC.CP.02007.zip", quiet = TRUE)

CodePudding user response:

Given the binary format of .zip files, consider the mode="wb" argument of download.file:

url2 <- paste0(
   "http://www.catastro.minhap.es/",
   "INSPIRE/CadastralParcels/02/",
   "02007-ALCALA DEL JUCAR/A.ES.SDGC.CP.02007.zip"
)

download.file(
    url2, 
    destfile = "A.ES.SDGC.CP.02007.zip", 
    mode = "wb"
    quiet = TRUE
)
  • Related