Looping though list of IDs to count matches in two columns-CodePudding

This is going to be a complicated one to explain so bear with me.

I am doing a blastp comparison multiple proteins all vs all and want the number of shared proteins between the genomes.

I have a large file of the query id and sequence id, example:

A       A       100
A       A       100
A       A       100
A       B       74
A       B       47
A       B       67
A       C       73
A       C       84
A       C       74
A       D       48
A       D       74
A       D       74
B       A       67
B       A       83
B       A       44
B       B       100

The file continues like that. I'd like to count the number of occurrences of A in column 1 and B in column 2. I have found a way to do this with awk:

awk -F, '$1=="A" && $2=="A"' file | wc -l

However, I have hundreds of genomes and this would involve typing the awk script thousands of times to get the different combinations. I had added the IDs from column 1 to a text file and tried a loop to loop through all the IDs for all possible combinations

for i in $(cat ID.txt); do input_file=file.csv; awk -F, '$1==$i && $2==$i' ${input_file} | wc -l; done

This is the output:

etc.

I'd like the output to be:

A       A       60
A       B       54
A       C       34
A       D       35

etc.

Any help would be appreciated.

CodePudding user response：

If I'm understanding correctly, then you can collect the count for each pair into an array, and then print out the array once complete:

 awk -F, '{  a[$1 FS $2]} END{for(entry in a){print entry, a[entry]}}' file

A,A 3
B,A 3
A,B 3
B,B 1
A,C 3
A,D 3

This is doing the following:

Increment the count in array a for the item with the key constructed from the concatenation of the first two columns, separated by the field separator FS (comma): { a[$1 FS $2]}
Once the file processing is done END, loop through the array calling each array entry entry, for (entry in a)
In the loop, print the key/entry and the value {print entry, a[entry]}