Home > Back-end >  Reading csv flie with commas in R that stops fread function
Reading csv flie with commas in R that stops fread function

Time:04-01

I am trying to read multiple csv files (like 300) with the function fread in R.

When i open one of the csv files in excel, the columns are delimited correctly, even when some observations contain commas.

When I try to read one of the files, the fuction does't read all the observations in the file and the next error appears

> file_prueba<-fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", skip = 5, header = TRUE)

Warning message:
In fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv",  :
  Stopped early on line 1073. Expected 17 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA","">>

Therefore i can't read the whole file. I suspect it is because one of the observations cointains commas like "PLOMERIA, TUBO DE COBRE, DE 60 MTS". But I'm not sure.

How can i fix this without fixing each csv file one by one?

Here's the file that i'm using int he example, but as I said, i need to read multiple files like this: https://drive.google.com/file/d/1gSjyL14sZQC5KNtMXhN_iN79xCETTZAG/view?usp=sharing

CodePudding user response:

The file is corrupt in two ways: lines 1073 and 3401 have embedded quotes. But there's another problem here ... read down to the second section fread and double-double-quotes for the problem with fread.

(Ultimately, this is a failure of the exporting process and a failure of fread to read embedded double quotes.)

Corrupted lines

Scroll right to see the problems.

  • Line 1073:

    "2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA",""
        --->                --->                --->                --->                --->                --->                --->                --->                --->                --->                --->                --->                --->                     ^-- this quote is incorrect
    
  • Line 3401:

    "2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2" X 6 MT","316.76","1","PZA",""
            --->                        --->                        --->                        --->                        --->                        --->                        --->                        --->                        --->                        --->        ^-- this quote is incorrect
    

The best fix is to get whatever person/process exported this to export compliant CSV.

Here is a command-line (sed) fix that will allow fread to load it without warning or error (this is on a shell prompt, not in R).

sed -i \
  -e 's/", PZA/"", PZA/g' \
  -e s'/BARRA DE 1\/2"/BARRA DE 1\/2""/g' \
  "INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV"

Simple explanation: the CSV standard (well-framed at https://en.wikipedia.org/wiki/Comma-separated_values) suggests that either double-quotes should never be in a quoted field, or if present they should be doubled (as in "" to produce a single " in the middle of a value).

In this case, it finds the two very specific failing text and adds the second quote.

  • -i means to make the change in-place; perhaps a more defensive use would be to do sed -e 's/../../g' -e 's/../../g' < oldfile.csv > newfile.csv, which would preserve the broken file. Over to you.
  • -e adds a sed script/command, multiple commands can be given.
  • s/from/to/g means to replace the pattern from with the string in to; the g means "global".

This changes the two lines (shown one after the other here for simplicity:

"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4"", PZA 6 MTS","231.55","1","PZA",""
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2"" X 6 MT","316.76","1","PZA",""
    --->                --->                --->                --->                --->                --->                --->                --->                --->                --->                --->                --->                --->                     ^^^^^-- the changes, double-double quotes

FYI: if you don't have sed in the path ... if you're running windows, then look in the RTools40 path; for me, I have c:/rtools40/usr/bin/sed.exe. If you're on macos or linux and cannot find sed, well ... that's odd.

After that sed command executes correctly, it will load without problem. HOWEVER, don't let this mislead you ... it is not really fixed. Keep reading.

csv <- fread("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv
#         Año   Mes             Fecha_Pub_DOF Clave ciudad                 Nombre ciudad    División
#       <int> <int>                    <char>        <int>                        <char>      <char>
#    1:  2020     6 20/07/2020 12:00:00 a. m.            1 Área Met. de la Cd. de México 3. Vivienda
#    2:  2020     6 20/07/2020 12:00:00 a. m.            1 Área Met. de la Cd. de México 3. Vivienda
#    3:  2020     6 20/07/2020 12:00:00 a. m.            1 Área Met. de la Cd. de México 3. Vivienda
...snip...
# 11 variables not shown: [Grupo <char>, Clase <char>, Subclase <char>, Clave genérico <int>, Genérico <char>, Consecutivo <int>, Especificación <char>, Precio promedio <num>, Cantidad <int>, Unidad <char>, ...]

fread and double-double-quotes

The problem with the above is that while it seems to have worked correctly, it (still) does not do embedded quotes correctly. As long as you want your data to have all of the embedded quotes that you want, then you cannot use fread, unfortunately.

Why?

str(csv[1067,])
# Classes 'data.table' and 'data.frame':    1 obs. of  17 variables:
#  $ Año            : int 2020
#  $ Mes            : int 6
#  $ Fecha_Pub_DOF  : chr "20/07/2020 12:00:00 a. m."
#  $ Clave ciudad   : int 12
#  $ Nombre ciudad  : chr "San Luis Potosí, S.L.P."
#  $ División       : chr "3. Vivienda"
#  $ Grupo          : chr "3.1. Costo de uso de vivienda"
#  $ Clase          : chr "3.1.1. Costo de uso de vivienda"
#  $ Subclase       : chr "42 Vivienda propia"
#  $ Clave genérico : int 140
#  $ Genérico       : chr "Productos para reparación menor de la vivienda"
#  $ Consecutivo    : int 1
#  $ Especificación : chr "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
#  $ Precio promedio: num 232
#  $ Cantidad       : int 1
#  $ Unidad         : chr "PZA"
#  $ Estatus        : chr ""
#  - attr(*, ".internal.selfref")=<externalptr> 

Namely, see

csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
                                          ^^^^ should only be a single "

Fortunately, read.csv works fine here:

csv <- read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\", PZA 6 MTS"

FYI, if you don't care about the embedded quotes, you can still use fread if you change the sed expressions to remove the double-quotes instead of doubling the double-quotes. That is, -e 's/", PZA/, PZA/g' and likewise for the second expression. I didn't recommend this first because it changes your data, which you should not have to do.

CodePudding user response:

The file you linked is properly quoted. It has 5 lines of non-CSV data though, so skip these:

csv = read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", header = T, skip = 5, fileEncoding = "Latin1")

This works fine for me. I am not so familiar with fread, and it does seem to have a problem with this file. Is there a reason you need data.table::fread for this?

  • Related