I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to the command line.
Here are a few things I have tried:
awk '($colname1==2 && $colname2==1) { count } END { print count }' file.txt
to filter out the columns based on their conditions
and
head -1 file.txt | tr '\t' | cat -n | grep "COLNAME
to try and return the possible column number related to the column.
An example file would be:
ID ad bd
1 a fire
2 b air
3 c water
4 c water
5 d water
6 c earth
Output would be: 2 (count of ad=c and bd=water)
CodePudding user response:
with your input file and the implied conditions this should work
$ awk -v c1='ad' -v c2='bd' 'NR==1{n=split($0,h); for(i=1;i<=n;i ) col[h[i]]=i}
$col[c1]=="c" && $col[c2]=="water"{count } END{print count 0}' file
2
or you can replace c1 and c2 with the values in the script as well.
to find the column indices you can run
$ awk -v cols='ad bd' 'BEGIN{n=split(cols,c); for(i=1;i<=n;i ) colmap[c[i]]}
NR==1{for(i=1;i<=NF;i ) if($i in colmap) print $i,i; exit}' file
ad 2
bd 3
or perhaps with this chain
$ sed 1q file | tr -s ' ' \\n | nl | grep -E 'ad|bd'
2 ad
3 bd
although may have false positives due to regex match...
You can rewrite the awk
to be more succinct
$ awk -v cols='ad bd' '{while( i<=NF) if(FS cols FS ~ FS $i FS) print $i,i;
exit}' file
ad 2
bd 3
CodePudding user response:
As I mentioned in an earlier comment, the answer at https://unix.stackexchange.com/a/359699/133219 shows how to do this:
awk -F'\t' '
NR==1 {
for (i=1; i<=NF; i ) {
f[$i] = i
}
}
($(f["ad"]) == "c") && ($(f["bd"]) == "water") { cnt }
END { print cnt 0 }
' file
2
I'm assuming your input is tab-separated due to the tr '\t'
in the command in your question that looks like you're trying to convert tabs to newlines to convert column names to numbers. If I'm wrong and they're just separated by any chains of white space then remove -F'\t'
from the above.
CodePudding user response:
Use miller
toolkit to manipulate tab-delimited files using column names. Below is a one-liner that filters a tab-delimited file (delimiter is specified using --tsv
) and writes the results to STDOUT together with the header. The header is removed using tail
and the lines are counted with wc
.
mlr --tsv filter '$ad == "c" && $bd == "water"' file.txt | tail -n 2 | wc -l
Prints:
2
SEE ALSO:
Note that miller
can be easily installed, for example, using conda
, like so:
conda create --name miller miller