Home > OS >  Looping though list of IDs to count matches in two columns
Looping though list of IDs to count matches in two columns

Time:06-07

This is going to be a complicated one to explain so bear with me.

I am doing a blastp comparison multiple proteins all vs all and want the number of shared proteins between the genomes.

I have a large file of the query id and sequence id, example:

A       A       100
A       A       100
A       A       100
A       B       74
A       B       47
A       B       67
A       C       73
A       C       84
A       C       74
A       D       48
A       D       74
A       D       74
B       A       67
B       A       83
B       A       44
B       B       100

The file continues like that. I'd like to count the number of occurrences of A in column 1 and B in column 2. I have found a way to do this with awk:

awk -F, '$1=="A" && $2=="A"' file | wc -l 

However, I have hundreds of genomes and this would involve typing the awk script thousands of times to get the different combinations. I had added the IDs from column 1 to a text file and tried a loop to loop through all the IDs for all possible combinations

for i in $(cat ID.txt); do input_file=file.csv; awk -F, '$1==$i && $2==$i' ${input_file} | wc -l; done

This is the output:

0
0
0
0
0
0
0

etc.

I'd like the output to be:

A       A       60
A       B       54
A       C       34
A       D       35

etc.

Any help would be appreciated.

CodePudding user response:

If I'm understanding correctly, then you can collect the count for each pair into an array, and then print out the array once complete:

 awk -F, '{  a[$1 FS $2]} END{for(entry in a){print entry, a[entry]}}' file

A,A 3
B,A 3
A,B 3
B,B 1
A,C 3
A,D 3

This is doing the following:

  1. Increment the count in array a for the item with the key constructed from the concatenation of the first two columns, separated by the field separator FS (comma): { a[$1 FS $2]}
  2. Once the file processing is done END, loop through the array calling each array entry entry, for (entry in a)
  3. In the loop, print the key/entry and the value {print entry, a[entry]}
  • Related