Given a TSV file with col2 that contains either a field or record separator (FS/RS) being respectively a tab or a carriage return which are escaped/surrounded by quotes.
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \cat -vet
col1^Icol2^Icol3$
1^I"A^IB"^I1234$
2^I"CD$
EF"^I567$
------ --------- ------
| col1 | col2 | col3 |
------ --------- ------
| 1 | "A B" | 1234 |
| 2 | "CD | 567 |
| | EF" | |
------ --------- ------
Is there a way in sed/awk/perl or even (preferably) miller/mlr to transform those pesky characters into spaces in order to generate the following result:
------ --------- ------
| col1 | col2 | col3 |
------ --------- ------
| 1 | "A B" | 1234 |
| 2 | "CD EF" | 567 |
------ --------- ------
I cannot get miller 6.2 to make the proper transformation (tried with DSL put/gsub) because it doesn't recognize the tab or CR/LF being part of the columns which breaks the field number:
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | mlr --opprint --barred --itsv cat
mlr : mlr: CSV header/data length mismatch 3 != 4 at filename (stdin) line 2.
CodePudding user response:
if you run
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --c2t --fs "\t" clean-whitespace
col1 col2 col3
1 A B 1234
2 CD EF 567
I'm using mlr 6.2.
CodePudding user response:
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
on_in => sub { s/\s / /g for @{$_[1]} },
sep_char => "\t";
'
Or s/[\t\n]/ /g
if you prefer.
Can be placed all on one line.
Input is accepted from file named by argument or STDIN.
CodePudding user response:
A good library nicely handles things like embedded quotes and newlines and offers flexibility
In a Perl script with Text::CSV
use warnings;
use strict;
use Text::CSV;
my $file = shift // die "Usage: $0 filename\n";
my $csv = Text::CSV->new( { binary => 1, sep_char => "\t", auto_diag => 1 } );
open my $fh, $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\s / /g for @$row; # collapse multiple spaces, tabs, newlines
$csv->say(*STDOUT, $row);
}
Note the many other options for the constructor that can help handle various irregularities.
This can be fit into a one-liner and its functional interface (with csv
) is particularly suited for that.
CodePudding user response:
With GNU awk for multi-char RS
, RT
, and gensub()
:
$ awk -v RS='"([^"]|"")*"' '{ORS=gensub(/[\n\t]/," ","g",RT)} 1' file
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
The above just uses RS
to isolate each "..."
string and saves it in RT
, then replaces every \n
or \t
in that string with a blank and saves the result in ORS
, then prints the record.
CodePudding user response:
you absolutely don't need gawk
to get this done - here's one that works for mawk
, gawk
, or macos nawk
:
INPUT
--before--
col1 col2 col3
1 "A B" 1234
2 "CD
EF" 567
CODE
{m,n,g}awk '
BEGIN {
1 __=substr((OFS=FS="\t\"")(FS)(ORS=_)\
(RS = "^$"),_ =_^=_<_,_)
}
END {
1 printbefore()
3 for (_^=_<_; _<=NF; _ ) {
3 sub(/[\t-\r] /, ($_~__)?" ":"&", $_)
}
1 print
}
1 function printbefore(_)
{
1 printf("\n\n--before--\n%s\n------"\
"------AFTER------\n\n", $ _)>("/dev/stderr")
}
OUTPUT
———AFTER (using mawk)------
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
strip out the part about printbefore()
that's more for debugging purposes, then it's just
{m,n,g}awk '
BEGIN { __=substr((OFS=FS="\t\"") FS \
(ORS=_) (RS="^$"),_ =_^=_<_,_)
} END {
for(--_;_<=NF;_ ) {
sub(/[\t-\r] /, $_~__?" ":"&",$_) } print }'