Converting fixed length columns in bash-CodePudding

I have a text file generated using curl -s <URL> | lynx -stdin -dump -nomargins -width=9999

It looks something like this:

name1  yes  yes           no   no 
name2  yes  yes  10   12  no   yes
name3  no   no   139  11  yes  no

So if I understand my file correctly each line has the same width in characters, but where there is missing data there is just spaces instead of NA or something usable.

I want to convert this into a tab separated file with NA or nothing where the data for the column is missing.

I know there is a way to do this in Excel using data import with fixed colujmn width, and R solution with read.fwf(), but I would like to find a bash solution.

Any help is appreciated.

CodePudding user response：

Assumptions:

a GNU awk solution is acceptable
data in all columns is left-justified
column #1 is a non-white space character
column data does not contain embedded spaces

To simulate the lynx output:

$ cat lynx.out
name1  yes  yes           no   no
name2  yes  yes  10   12  no   yes
name3  no   no   139  11  yes  no
name4       yes       3   no

NOTE: Following answers require GNU awk 4.0 for

3rd argument to asorti()
4th argument to split()

Assuming lynx output can be captured to an output file (lynx.out), one GNU awk idea requiring 2 passes of the input file:

awk '
BEGIN   { OFS="\t" }
FNR==NR { gsub(/ *$/,"")                                     # strip trailing spaces from line
          n=split($0,a," {1,}",seps)
          sum=0
          for (i=1; i<=n; i  ) {
              colpos[sum 1]                                  # keep track of all column positions as indexes of the colpos[] array
              sum= sum   length($i seps[i])
          }
          next
        }
FNR==1  { n=asorti(colpos,start,"@ind_num_asc") }            # sort column positions in numerical order; start[] array contents will contain the positions sorted in nuemrical order
        { sep=""
          for (i=1;i<=n;i  ) {                               # loop through list of positions extracing data via substr() calls
              val=substr($0,start[i], start[i 1]==0 ? length($0) : start[i 1]-start[i])
              gsub(/ *$/,"",val)                             # strip trailing spaces from valu
              if (! val) val="NA"
              printf "%s%s",sep,val
              sep=OFS
          }
          print ""
        }
' lynx.out lynx.out

Same concept but using a single pass through the data (input from a file or stdin) and assuming the entire lynx output can be saved in memory (via the lines[] array):

awk '
BEGIN { OFS="\t" }
      { sub(/ *$/,"")
        lines[  c]=$0                                       # save current line in memory
        n=split($0,a," {1,}",seps)
        sum=0
        for (i=1; i<=n; i  ) {
            colpos[sum 1]
            sum= sum   length($i seps[i])
        }
      }

END   { n=asorti(colpos,start,"@ind_num_asc")
        for (lineno=1;lineno<=c;lineno  ) {                 # loop through array of input lines
            sep=""
            for (i=1;i<=n;i  ) {
                val=substr(lines[lineno],start[i], start[i 1]==0 ? length(lines[lineno]) : start[i 1]-start[i])
                gsub(/ *$/,"",val)
                if (! val) val="NA"
                printf "%s%s",sep,val
                sep=OFS
            }
            print ""
        }
      }
'

NOTES:

invoke one of 2 ways:
cat lynx.out | awk '....'
awk '.....' lynx.out

Both of these generate:

name1   yes     yes     NA      NA      no      no
name2   yes     yes     10      12      no      yes
name3   no      no      139     11      yes     no
name4   NA      yes     NA      3       no      NA

Modifying the code to replace "NA" with "" generates:

name1   yes     yes                     no      no
name2   yes     yes     10      12      no      yes
name3   no      no      139     11      yes     no
name4           yes             3       no

CodePudding user response：

^{Edit: Updated the heuristic for determining the input fields}

This standard awk program should work in most cases:

awk -F'^$' -v OFS='\t' '
    nc < ( n = length() ) {
        for ( i = nc   1; i <= n; i   )
            map[i] = 0
        nc = n
    }
    {
        records[NR] = $0
        for ( i = 1; i <= nc; i   )
            if ( substr($0, i, 1) != " " )
                map[i] = 1
    }
    END {
        for ( i = 1; i <= nc; i   ) {
            if ( map[i] ) {
                if ( !map[i-1] ) {
                    nf  
                    rstart[nf] = i
                }
                rlength[nf]  = 1
            }
        }
        for ( n = 1; n <= NR; n   ) {
            for ( i = 1; i <= nf; i   ) {
                $i = substr( records[n], rstart[i], rlength[i] )
                gsub( /^  |  $/, "", $i )
            }
            print
        }
    }
'

Example:

echo "\
name1  yes  yes           no
name2  yes       10   12       yes
            no   139  11  yes  no " |
awk ... |
tr '\t' '|' # for showing the fields more clearly

name1|yes|yes|||no|
name2|yes||10|12||yes
||no|139|11|yes|no

CodePudding user response：

Here is a multi-pass Perl option. Optimisation is surely possible.

The basic idea is to find the offsets where we never have field data and put tabs there.

Sample input to lynx:

<html>
<body>
<table>
<tr><td>name1</td><td>yes</td><td>maybe</td><td></td><td></td><td>no</td><td>no</td></tr>
<tr><td>name2</td><td>yes</td><td>yes</td><td>10</td><td>12</td><td>no</td><td>yes</td></tr>
<tr><td>name4</td><td></td><td>maybe not</td><td></td><td></td><td></td><td></td></tr>
<tr><td>name3</td><td>no</td><td>no</td><td>139</td><td>11</td><td>yes</td><td>no</td></tr>
</table>
</body>
<html>

Sample output from lynx -stdin -dump -nomargins -width=9999:

name1 yes maybe            no  no
name2 yes yes       10  12 no  yes
name4     maybe not
name3 no  no        139 11 yes no

Process with Perl. Offsets with zero count can become tabs.

$ <lynx.out perl -nle '
    $cols[ $-[0] ]   while /[^ ]/g;
    print $_, "\n", map {$_||"-"} @cols;

    END {
        @offsets = grep { !$cols[$_] } 0..$#cols;
        print "\n", "@offsets";
    }
'
name1 yes maybe            no  no
11111-111-11111------------11--11
name2 yes yes       10  12 no  yes
22222-222-22211-----11--11-22--221
name4     maybe not
33333-222-33322-111-11--11-22--221
name3 no  no        139 11 yes no
44444-332-44322-111-221-22-331-331

5 9 15 19 23 26 30

We can perform tab substitution with a little adjustment to the code (using | here instead of actual tabs for clarity):

$ <lynx.out perl -nle '
    push @lines, $_;
    $cols[ $-[0] ]   while /[^ ]/g;

    END {
        @offsets = grep { !$cols[$_] } 0..$#cols;

        for (@lines) {
                         # pad short lines
            $line = $_ . " " x (@cols-length);

            substr($line,$_,1) = "|" for @offsets;
            print $line;
        }
    }
'
name1|yes|maybe|   |   |  |no |no 
name2|yes|yes  |   |10 |12|no |yes
name4|   |maybe|not|   |  |   |   
name3|no |no   |   |139|11|yes|no

An extra column has appeared. This is unavoidable when the input is ambiguous (as in this example). If cells cannot contains space, this will not be an issue.

Finally, strip leading and trailing space between "tabs". For example, by adding before the final print:

        $line =~ s/ *\| */|/g;