How can I clean a TSV file having record or fields separators in one of its fields?-CodePudding

Given a TSV file with col2 that contains either a field or record separator (FS/RS) being respectively a tab or a carriage return which are escaped/surrounded by quotes.

$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \cat -vet
col1^Icol2^Icol3$
1^I"A^IB"^I1234$
2^I"CD$
EF"^I567$

 ------ --------- ------ 
| col1 | col2    | col3 |
 ------ --------- ------ 
| 1    | "A   B" | 1234 |
| 2    | "CD     | 567  |
|      | EF"     |      |
 ------ --------- ------

Is there a way in sed/awk/perl or even (preferably) miller/mlr to transform those pesky characters into spaces in order to generate the following result:

 ------ --------- ------ 
| col1 | col2    | col3 |
 ------ --------- ------ 
| 1    | "A B"   | 1234 |
| 2    | "CD EF" | 567  |
 ------ --------- ------

I cannot get miller 6.2 to make the proper transformation (tried with DSL put/gsub) because it doesn't recognize the tab or CR/LF being part of the columns which breaks the field number:

$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | mlr --opprint --barred --itsv cat
mlr :  mlr: CSV header/data length mismatch 3 != 4 at filename (stdin) line  2.

CodePudding user response：

if you run

printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --c2t --fs "\t" clean-whitespace

col1    col2    col3
1       A B     1234
2       CD EF   567

I'm using mlr 6.2.

CodePudding user response：

perl -MText::CSV_XS=csv -e'
   csv
      in       => *ARGV,
      on_in    => sub { s/\s / /g for @{$_[1]} },
      sep_char => "\t";
'

Or s/[\t\n]/ /g if you prefer.

Can be placed all on one line.

Input is accepted from file named by argument or STDIN.

CodePudding user response：

A good library nicely handles things like embedded quotes and newlines and offers flexibility

In a Perl script with Text::CSV

use warnings; 
use strict;

use Text::CSV;

my $file = shift // die "Usage: $0 filename\n";

my $csv = Text::CSV->new( { binary => 1, sep_char => "\t", auto_diag => 1 } ); 

open my $fh, $file or die "Can't open $file: $!"; 

while (my $row = $csv->getline($fh)) { 
    s/\s / /g for @$row;      # collapse multiple spaces, tabs, newlines
    $csv->say(*STDOUT, $row);
}

Note the many other options for the constructor that can help handle various irregularities.

This can be fit into a one-liner and its functional interface (with csv) is particularly suited for that.

CodePudding user response：

With GNU awk for multi-char RS, RT, and gensub():

$ awk -v RS='"([^"]|"")*"' '{ORS=gensub(/[\n\t]/," ","g",RT)} 1' file
col1    col2    col3
1       "A B"   1234
2       "CD EF" 567

The above just uses RS to isolate each "..." string and saves it in RT, then replaces every \n or \t in that string with a blank and saves the result in ORS, then prints the record.

CodePudding user response：

you absolutely don't need gawk to get this done - here's one that works for mawk, gawk, or macos nawk :

INPUT

--before--
col1    col2    col3
1   "A  B"  1234
2   "CD
EF" 567

CODE

{m,n,g}awk '
BEGIN {
 1      __=substr((OFS=FS="\t\"")(FS)(ORS=_)\
                      (RS = "^$"),_ =_^=_<_,_)
}
END {
 1      printbefore()
 3      for (_^=_<_; _<=NF; _  ) {
 3          sub(/[\t-\r] /, ($_~__)?" ":"&", $_)
        }
 1      print
}
 1  function printbefore(_)
{
 1      printf("\n\n--before--\n%s\n------"\
               "------AFTER------\n\n", $ _)>("/dev/stderr")
}

OUTPUT

———AFTER  (using mawk)------
 
col1  col2    col3
1     "A B"   1234
2     "CD EF" 567

strip out the part about printbefore() that's more for debugging purposes, then it's just

{m,n,g}awk '

BEGIN { __=substr((OFS=FS="\t\"") FS \
                  (ORS=_)        (RS="^$"),_ =_^=_<_,_) 
} END {
    for(--_;_<=NF;_  ) {
        sub(/[\t-\r] /, $_~__?" ":"&",$_) } print }'