Im scraping a ASP.NET website.
This will return a raw element (reporte_nacido
) which is a csv file (tab as delimiter):
reporte_nacido = postForm('https://xxxxx/WebSiteNDE/BirthsPages/FiltrosExcelNac.aspx',
.params = params,
curl = curl,
.opts = RCurl::curlOptions(ssl.verifypeer=FALSE, verbose=T))
If i load the file on a text viewer, it looks like this
Now im trying to load that raw element within R but i get the following error. I believe the file downloaded from the server comes corrupted somehow and R is being picky about it
rawToChar(as.vector(unlist(reporte_nacido)))
Error in rawToChar(as.vector(unlist(reporte_nacido))) :
embedded nul in string: '\xfe\xff\0N\0\xda\0M\0E\0R\0O\0 \0C\0E\0R\0T\0I\0F\0I\0C\0A\0D\0O\0\t\0D\0E\0P\0A\0R\0T\0A\0M\0E\0N\0T\0O\0\t\0M\0U\0N\0I\0C\0I\0P\0I\0O\0\t\0A\0R\0E\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0I\0N\0S\0P\0E\0C\0C\0I\0O\0N\0 \0C\0O\0R\0R\0E\0G\0I\0M\0I\0E\0N\0T\0O\0 \0O\0 \0C\0A\0S\0E\0R\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0S\0I\0T\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0C\0\xd3\0D\0I\0G\0O\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0N\0O\0M\0B\0R\0E\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0S\0E\0X\0O\0\t\0P\0E\0S\0O\0 \0(\0G\0r\0a\0m\0o\0s\0)\0\t\0T\0A\0L\0L\0A\0 \0(\0C\0e\0n\0t\0\xed\0m\0e\0t\0r\0o\0s\0)\0\t\0F\0E\0C\0H\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0H\0O\0R\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0P\0A\0R\0T\0O\0 \0A\0T\0E\0N\0D\0I\0D\0O\0 \0P\0O\0R\0\t\0T\0I\0E\0M\0P\0O\0 \0D\0E\0 \0G\0E\0S\0T\0A\0C\0I\0\xd3\0N\0\t\0N\0\xda\0M\0E\0R\0O\0 \0C\0O\0N\0S\0U\0L\0T\0A\0S\0 \0P\0R\0E\0N\0A\0T\0A\0L\0E\0S\0\t\0T\0I\0P\0O\0 \0P\0A
CodePudding user response:
The raw vector you are getting is text encoded as UTF-16. You can convert it like this:
library(stringi)
raw_vec <- as.vector(unlist(reporte_nacido))
decoded <- stri_encode(raw_vec, "UTF16")
decoded
#> [1] "NÚMERO CERTIFICADO\tDEPARTAMENTO\tMUNICIPIO\tAREA NACIMIENTO\tINSPECCION CORREGIMIENTO O CASERIO NACIMIENTO\tSITIO NACIMIENTO\tCÓDIGO INSTITUCIÓN\tNOMBRE INSTITUCIÓN\tSEXO\tPESO (Gramos)\tTALLA (Centímetros)\tFECHA NACIMIENTO\tHORA NACIMIENTO\tPARTO ATENDIDO POR\tTIEMPO DE GESTACIÓN\tNÚMERO CONSULTAS PRENATALES\tTIPO PA"
It appears to be tab-separated rather than csv format, so you probably want to read it like this:
read.table(text = decoded, sep = "\t", header = TRUE)