Filtering on a condition using the column names and not numbers-CodePudding

I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to the command line.

Here are a few things I have tried:

awk '($colname1==2 && $colname2==1) { count } END { print count }' file.txt to filter out the columns based on their conditions

and head -1 file.txt | tr '\t' | cat -n | grep "COLNAME to try and return the possible column number related to the column.

An example file would be:

ID  ad   bd

1   a   fire
2   b   air
3   c   water
4   c   water
5   d   water
6   c   earth

Output would be: 2 (count of ad=c and bd=water)

CodePudding user response：

with your input file and the implied conditions this should work

$ awk -v c1='ad' -v c2='bd' 'NR==1{n=split($0,h); for(i=1;i<=n;i  ) col[h[i]]=i} 
                             $col[c1]=="c" && $col[c2]=="water"{count  } END{print count 0}' file

2

or you can replace c1 and c2 with the values in the script as well.

to find the column indices you can run

$ awk -v cols='ad bd' 'BEGIN{n=split(cols,c); for(i=1;i<=n;i  ) colmap[c[i]]} 
                       NR==1{for(i=1;i<=NF;i  ) if($i in colmap) print $i,i; exit}' file

ad 2
bd 3

or perhaps with this chain

$ sed 1q file | tr -s ' ' \\n | nl | grep -E 'ad|bd'

     2  ad
     3  bd

although may have false positives due to regex match...

You can rewrite the awk to be more succinct

$ awk -v cols='ad bd' '{while(  i<=NF) if(FS cols FS ~ FS $i FS) print $i,i; 
                        exit}' file

ad 2
bd 3

CodePudding user response：

As I mentioned in an earlier comment, the answer at https://unix.stackexchange.com/a/359699/133219 shows how to do this:

awk -F'\t' '
    NR==1 {
        for (i=1; i<=NF; i  ) {
            f[$i] = i
        }
    }
    ($(f["ad"]) == "c") && ($(f["bd"]) == "water") { cnt   }
    END { print cnt 0 }
' file
2

I'm assuming your input is tab-separated due to the tr '\t' in the command in your question that looks like you're trying to convert tabs to newlines to convert column names to numbers. If I'm wrong and they're just separated by any chains of white space then remove -F'\t' from the above.

CodePudding user response：

Use miller toolkit to manipulate tab-delimited files using column names. Below is a one-liner that filters a tab-delimited file (delimiter is specified using --tsv) and writes the results to STDOUT together with the header. The header is removed using tail and the lines are counted with wc.

mlr --tsv filter '$ad == "c" && $bd == "water"' file.txt | tail -n  2 | wc -l

Prints:

SEE ALSO:

miller manual

Note that miller can be easily installed, for example, using conda, like so:

conda create --name miller miller