This is going to be a complicated one to explain so bear with me.
I am doing a blastp comparison multiple proteins all vs all and want the number of shared proteins between the genomes.
I have a large file of the query id and sequence id, example:
A A 100
A A 100
A A 100
A B 74
A B 47
A B 67
A C 73
A C 84
A C 74
A D 48
A D 74
A D 74
B A 67
B A 83
B A 44
B B 100
The file continues like that. I'd like to count the number of occurrences of A in column 1 and B in column 2. I have found a way to do this with awk:
awk -F, '$1=="A" && $2=="A"' file | wc -l
However, I have hundreds of genomes and this would involve typing the awk script thousands of times to get the different combinations. I had added the IDs from column 1 to a text file and tried a loop to loop through all the IDs for all possible combinations
for i in $(cat ID.txt); do input_file=file.csv; awk -F, '$1==$i && $2==$i' ${input_file} | wc -l; done
This is the output:
0
0
0
0
0
0
0
etc.
I'd like the output to be:
A A 60
A B 54
A C 34
A D 35
etc.
Any help would be appreciated.
CodePudding user response:
If I'm understanding correctly, then you can collect the count for each pair into an array, and then print out the array once complete:
awk -F, '{ a[$1 FS $2]} END{for(entry in a){print entry, a[entry]}}' file
A,A 3
B,A 3
A,B 3
B,B 1
A,C 3
A,D 3
This is doing the following:
- Increment the count in array
a
for the item with the key constructed from the concatenation of the first two columns, separated by the field separatorFS
(comma):{ a[$1 FS $2]}
- Once the file processing is done
END
, loop through the array calling each array entryentry
,for (entry in a)
- In the loop, print the key/entry and the value
{print entry, a[entry]}