Home > Blockchain >  How to get rid of embedded NUL on a raw vector?
How to get rid of embedded NUL on a raw vector?

Time:11-05

Im scraping a ASP.NET website.

This will return a raw element (reporte_nacido) which is a csv file (tab as delimiter):

reporte_nacido = postForm('https://xxxxx/WebSiteNDE/BirthsPages/FiltrosExcelNac.aspx',
                               .params = params, 
                               curl = curl,
                               .opts = RCurl::curlOptions(ssl.verifypeer=FALSE, verbose=T))

If i load the file on a text viewer, it looks like this

enter image description here

Now im trying to load that raw element within R but i get the following error. I believe the file downloaded from the server comes corrupted somehow and R is being picky about it

rawToChar(as.vector(unlist(reporte_nacido)))

Error in rawToChar(as.vector(unlist(reporte_nacido))) : 
  embedded nul in string: '\xfe\xff\0N\0\xda\0M\0E\0R\0O\0 \0C\0E\0R\0T\0I\0F\0I\0C\0A\0D\0O\0\t\0D\0E\0P\0A\0R\0T\0A\0M\0E\0N\0T\0O\0\t\0M\0U\0N\0I\0C\0I\0P\0I\0O\0\t\0A\0R\0E\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0I\0N\0S\0P\0E\0C\0C\0I\0O\0N\0 \0C\0O\0R\0R\0E\0G\0I\0M\0I\0E\0N\0T\0O\0 \0O\0 \0C\0A\0S\0E\0R\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0S\0I\0T\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0C\0\xd3\0D\0I\0G\0O\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0N\0O\0M\0B\0R\0E\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0S\0E\0X\0O\0\t\0P\0E\0S\0O\0 \0(\0G\0r\0a\0m\0o\0s\0)\0\t\0T\0A\0L\0L\0A\0 \0(\0C\0e\0n\0t\0\xed\0m\0e\0t\0r\0o\0s\0)\0\t\0F\0E\0C\0H\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0H\0O\0R\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0P\0A\0R\0T\0O\0 \0A\0T\0E\0N\0D\0I\0D\0O\0 \0P\0O\0R\0\t\0T\0I\0E\0M\0P\0O\0 \0D\0E\0 \0G\0E\0S\0T\0A\0C\0I\0\xd3\0N\0\t\0N\0\xda\0M\0E\0R\0O\0 \0C\0O\0N\0S\0U\0L\0T\0A\0S\0 \0P\0R\0E\0N\0A\0T\0A\0L\0E\0S\0\t\0T\0I\0P\0O\0 \0P\0A

CodePudding user response:

The raw vector you are getting is text encoded as UTF-16. You can convert it like this:

library(stringi)

raw_vec <- as.vector(unlist(reporte_nacido))

decoded <- stri_encode(raw_vec, "UTF16")

decoded
#> [1] "NÚMERO CERTIFICADO\tDEPARTAMENTO\tMUNICIPIO\tAREA NACIMIENTO\tINSPECCION CORREGIMIENTO O CASERIO NACIMIENTO\tSITIO NACIMIENTO\tCÓDIGO INSTITUCIÓN\tNOMBRE INSTITUCIÓN\tSEXO\tPESO (Gramos)\tTALLA (Centímetros)\tFECHA NACIMIENTO\tHORA NACIMIENTO\tPARTO ATENDIDO POR\tTIEMPO DE GESTACIÓN\tNÚMERO CONSULTAS PRENATALES\tTIPO PA"

It appears to be tab-separated rather than csv format, so you probably want to read it like this:

read.table(text = decoded, sep = "\t", header = TRUE)
  •  Tags:  
  • r
  • Related