Home > Net >  removing new line characters in csv file from inside columns in unix
removing new line characters in csv file from inside columns in unix

Time:09-17

Sample input Data::

cat test.csv
234,aa,bb,cc,30,dd

22,cc,

ff,dd,

40,gg

pxy,aa,,cc,

40

,dd

Required output ::

cat test_new.csv
234,aa,bb,cc,30,dd

22,cc,ff,dd,40,gg

pxy,aa,,cc,40,dd

I tried various awk and sed commands, nothing seems to be working out.

eg:: awk 'NR == 1{ printf $0; next } { printf "%s%s", (/^[0-9] ,/? ORS : ""), $0 } END{ print "" }'

Above awk looks only for numeric characters with "," , but this will create extra line if numeric comes in a column with new line character.

Input file contains Control M characters, that I'm already removing using dos2unix

CodePudding user response:

Regarding "Input file contains Control M characters, that i'm already removing using dos2unix" - that's probably making your life harder. There's a good chance that the "newline"s within a line are just LFs while the record endings are CRLF and so by removing the CRs we can no longer simply set the awk RS to CRLF and then remove any LFs remaining in the record.

If your file was exported from Excel as you say in the comments under your question then, whether the fields are quoted or not, it'll use CRLF at the end of each record, and LF alone within each field which looks like this if you just don't run dos2unix on it:

$ cat -v test.csv
234,aa,bb,cc,30,dd^M
22,cc,
ff,dd,
40,gg^M
pxy,aa,,cc,
40
,dd^M

so then using any awk:

$ awk -v ORS= '{sub(/\r$/,"\n")} 1' test.csv
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd

or, if you prefer, using GNU awk for multi-char RS:

$ awk -v RS='\r\n' '{gsub(/\n/,"")} 1' test.csv
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd

For more information on parsing CSV with awk see What's the most robust way to efficiently parse CSV using awk?.

Regarding your original script - never use printf $0, always use printf "%s", $0 instead since the former will fail when $0 contains any printf formatting chars such as %s:

$ echo 'a%sb' | awk '{printf "%s", $0}'
a%sb$

$ echo 'a%sb' | awk '{printf $0}'
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: not enough arguments to satisfy format string
        `a%sb'
          ^ ran out for this one
$

That applies to using printf on all input data.

  • Related