Home > Net >  How can I parse CSV files with quoted fields containing commas, in awk?
How can I parse CSV files with quoted fields containing commas, in awk?

Time:03-20

I have a big CSV field, and I use awk with the field separator set to a comma. However, some fields are quoted and contain a comma, and I'm facing this issue:

Original file:

Downloads $  cat testfile.csv
"aaa","bbb","ccc","dddd"
"aaa","bbb","ccc","d,dd,d"
"aaa","bbb","ccc","dd,d,d"

I am trying this way:

Downloads $  cat testfile.csv | awk -F "," '{ print $2","$3","$4 }'
"bbb","ccc","dddd"
"bbb","ccc","d
"bbb","ccc","dd

Expecting result:

"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"

CodePudding user response:

Include (escaped) quotes in your field separator flag, and add them to your output print fields:

testfile.csv | awk -F "\",\"" '{print "\""$2"\",\""$3"\",\""$4}'

output:

"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"

CodePudding user response:

I would use a tool that is able to properly parse CSV, such as xsv. With it, the command would look like

$ xsv select 2-4 testfile.csv 
bbb,ccc,dddd
bbb,ccc,"d,dd,d"
bbb,ccc,"dd,d,d"

or, if you really want every value quoted, with a second step:

$ xsv select 2-4 testfile.csv | xsv fmt --quote-always
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"

CodePudding user response:

If gawk or GNU awk is available, you can make use of FPAT, which matches the fields, instead of splitting on field separators.

awk -v FPAT='([^,] )|(\"[^\"] \")' -v OFS=, '{print $2, $3, $4}' testfile.csv

Result:

"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"

The string ([^,] )|(\"[^\"] \") is a regex pattern which matches either of:

  • ([^,] ) ... matches a sequence of any characters other than a comma.
  • (\"[^\"] \") ... matches a string enclosed by double quotes (which may include commas in between).

The parentheses around the patterns are put for visual clarity purpose and the regex will work without them such as FPAT='[^,] |\"[^\"] \"' because the alternative | has lower precedence.

  • Related