Sample input Data::
cat test.csv
234,aa,bb,cc,30,dd
22,cc,
ff,dd,
40,gg
pxy,aa,,cc,
40
,dd
Required output ::
cat test_new.csv
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
I tried various awk
and sed
commands, nothing seems to be working out.
eg:: awk 'NR == 1{ printf $0; next } { printf "%s%s", (/^[0-9] ,/? ORS : ""), $0 } END{ print "" }'
Above awk
looks only for numeric characters with "," , but this will create extra line if numeric comes in a column with new line character.
Input file contains Control M characters, that I'm already removing using dos2unix
CodePudding user response:
Regarding "Input file contains Control M characters, that i'm already removing using dos2unix" - that's probably making your life harder. There's a good chance that the "newline"s within a line are just LF
s while the record endings are CRLF
and so by removing the CR
s we can no longer simply set the awk RS
to CRLF
and then remove any LF
s remaining in the record.
If your file was exported from Excel as you say in the comments under your question then, whether the fields are quoted or not, it'll use CRLF
at the end of each record, and LF
alone within each field which looks like this if you just don't run dos2unix
on it:
$ cat -v test.csv
234,aa,bb,cc,30,dd^M
22,cc,
ff,dd,
40,gg^M
pxy,aa,,cc,
40
,dd^M
so then using any awk:
$ awk -v ORS= '{sub(/\r$/,"\n")} 1' test.csv
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
or, if you prefer, using GNU awk for multi-char RS:
$ awk -v RS='\r\n' '{gsub(/\n/,"")} 1' test.csv
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
For more information on parsing CSV with awk see What's the most robust way to efficiently parse CSV using awk?.
Regarding your original script - never use printf $0
, always use printf "%s", $0
instead since the former will fail when $0
contains any printf formatting chars such as %s
:
$ echo 'a%sb' | awk '{printf "%s", $0}'
a%sb$
$ echo 'a%sb' | awk '{printf $0}'
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: not enough arguments to satisfy format string
`a%sb'
^ ran out for this one
$
That applies to using printf
on all input data.