I have a text file generated using curl -s <URL> | lynx -stdin -dump -nomargins -width=9999
It looks something like this:
name1 yes yes no no
name2 yes yes 10 12 no yes
name3 no no 139 11 yes no
So if I understand my file correctly each line has the same width in characters, but where there is missing data there is just spaces instead of NA or something usable.
I want to convert this into a tab separated file with NA or nothing where the data for the column is missing.
I know there is a way to do this in Excel using data import with fixed colujmn width, and R solution with read.fwf()
, but I would like to find a bash solution.
Any help is appreciated.
CodePudding user response:
Assumptions:
- a
GNU awk
solution is acceptable - data in all columns is left-justified
- column #1 is a non-white space character
- column data does not contain embedded spaces
To simulate the lynx
output:
$ cat lynx.out
name1 yes yes no no
name2 yes yes 10 12 no yes
name3 no no 139 11 yes no
name4 yes 3 no
NOTE: Following answers require GNU awk 4.0
for
- 3rd argument to
asorti()
- 4th argument to
split()
Assuming lynx
output can be captured to an output file (lynx.out
), one GNU awk
idea requiring 2 passes of the input file:
awk '
BEGIN { OFS="\t" }
FNR==NR { gsub(/ *$/,"") # strip trailing spaces from line
n=split($0,a," {1,}",seps)
sum=0
for (i=1; i<=n; i ) {
colpos[sum 1] # keep track of all column positions as indexes of the colpos[] array
sum= sum length($i seps[i])
}
next
}
FNR==1 { n=asorti(colpos,start,"@ind_num_asc") } # sort column positions in numerical order; start[] array contents will contain the positions sorted in nuemrical order
{ sep=""
for (i=1;i<=n;i ) { # loop through list of positions extracing data via substr() calls
val=substr($0,start[i], start[i 1]==0 ? length($0) : start[i 1]-start[i])
gsub(/ *$/,"",val) # strip trailing spaces from valu
if (! val) val="NA"
printf "%s%s",sep,val
sep=OFS
}
print ""
}
' lynx.out lynx.out
Same concept but using a single pass through the data (input from a file or stdin) and assuming the entire lynx
output can be saved in memory (via the lines[]
array):
awk '
BEGIN { OFS="\t" }
{ sub(/ *$/,"")
lines[ c]=$0 # save current line in memory
n=split($0,a," {1,}",seps)
sum=0
for (i=1; i<=n; i ) {
colpos[sum 1]
sum= sum length($i seps[i])
}
}
END { n=asorti(colpos,start,"@ind_num_asc")
for (lineno=1;lineno<=c;lineno ) { # loop through array of input lines
sep=""
for (i=1;i<=n;i ) {
val=substr(lines[lineno],start[i], start[i 1]==0 ? length(lines[lineno]) : start[i 1]-start[i])
gsub(/ *$/,"",val)
if (! val) val="NA"
printf "%s%s",sep,val
sep=OFS
}
print ""
}
}
'
NOTES:
- invoke one of 2 ways:
cat lynx.out | awk '....'
awk '.....' lynx.out
Both of these generate:
name1 yes yes NA NA no no
name2 yes yes 10 12 no yes
name3 no no 139 11 yes no
name4 NA yes NA 3 no NA
Modifying the code to replace "NA"
with ""
generates:
name1 yes yes no no
name2 yes yes 10 12 no yes
name3 no no 139 11 yes no
name4 yes 3 no
CodePudding user response:
Edit: Updated the heuristic for determining the input fields
This standard awk
program should work in most cases:
awk -F'^$' -v OFS='\t' '
nc < ( n = length() ) {
for ( i = nc 1; i <= n; i )
map[i] = 0
nc = n
}
{
records[NR] = $0
for ( i = 1; i <= nc; i )
if ( substr($0, i, 1) != " " )
map[i] = 1
}
END {
for ( i = 1; i <= nc; i ) {
if ( map[i] ) {
if ( !map[i-1] ) {
nf
rstart[nf] = i
}
rlength[nf] = 1
}
}
for ( n = 1; n <= NR; n ) {
for ( i = 1; i <= nf; i ) {
$i = substr( records[n], rstart[i], rlength[i] )
gsub( /^ | $/, "", $i )
}
print
}
}
'
Example:
echo "\
name1 yes yes no
name2 yes 10 12 yes
no 139 11 yes no " |
awk ... |
tr '\t' '|' # for showing the fields more clearly
name1|yes|yes|||no|
name2|yes||10|12||yes
||no|139|11|yes|no
CodePudding user response:
Here is a multi-pass Perl option. Optimisation is surely possible.
The basic idea is to find the offsets where we never have field data and put tabs there.
Sample input to lynx:
<html>
<body>
<table>
<tr><td>name1</td><td>yes</td><td>maybe</td><td></td><td></td><td>no</td><td>no</td></tr>
<tr><td>name2</td><td>yes</td><td>yes</td><td>10</td><td>12</td><td>no</td><td>yes</td></tr>
<tr><td>name4</td><td></td><td>maybe not</td><td></td><td></td><td></td><td></td></tr>
<tr><td>name3</td><td>no</td><td>no</td><td>139</td><td>11</td><td>yes</td><td>no</td></tr>
</table>
</body>
<html>
Sample output from lynx -stdin -dump -nomargins -width=9999
:
name1 yes maybe no no
name2 yes yes 10 12 no yes
name4 maybe not
name3 no no 139 11 yes no
Process with Perl. Offsets with zero count can become tabs.
$ <lynx.out perl -nle '
$cols[ $-[0] ] while /[^ ]/g;
print $_, "\n", map {$_||"-"} @cols;
END {
@offsets = grep { !$cols[$_] } 0..$#cols;
print "\n", "@offsets";
}
'
name1 yes maybe no no
11111-111-11111------------11--11
name2 yes yes 10 12 no yes
22222-222-22211-----11--11-22--221
name4 maybe not
33333-222-33322-111-11--11-22--221
name3 no no 139 11 yes no
44444-332-44322-111-221-22-331-331
5 9 15 19 23 26 30
We can perform tab substitution with a little adjustment to the code
(using |
here instead of actual tabs for clarity):
$ <lynx.out perl -nle '
push @lines, $_;
$cols[ $-[0] ] while /[^ ]/g;
END {
@offsets = grep { !$cols[$_] } 0..$#cols;
for (@lines) {
# pad short lines
$line = $_ . " " x (@cols-length);
substr($line,$_,1) = "|" for @offsets;
print $line;
}
}
'
name1|yes|maybe| | | |no |no
name2|yes|yes | |10 |12|no |yes
name4| |maybe|not| | | |
name3|no |no | |139|11|yes|no
An extra column has appeared. This is unavoidable when the input is ambiguous (as in this example). If cells cannot contains space, this will not be an issue.
Finally, strip leading and trailing space between "tabs". For example, by adding before the final print:
$line =~ s/ *\| */|/g;