I have a tab delimited file that looks like this:
input_sequence | match_sequence | score | receptor_group | epitope | antigen | organism |
---|---|---|---|---|---|---|
ASRPPGGVNEQF | ASRPPGGVNEQF | 1.00 | 25735 | EPLPQGQLTAY | Trans-activator protein BZLF1 [Severe acute respiratory syndrome coronavirus 2] | SARS-CoV2 |
ASSYSGGYEQY | ASSYSGGYEQY | 1.00 | 33843 | KTAYSHLSTSK | polymerase Hepatitis B virus (hepatitis B virus (HBV)) | hep b |
ASSYSGGYEQY | ASSYSGGYEQY | 1.00 | 131430 | KLSYGIATV | orf1ab polyprotein [Severe acute respiratory syndrome coronavirus 2] | SARS-CoV2 |
ASSYSGGYEQY | ASSFSGGYEQY | 0.97 | 82603 | FTISVTTEIL | surface glycoprotein [Severe acute respiratory syndrome coronavirus 2] | SARS-CoV2 |
ASSYSGGYEQY | ASSYAGGYEQY | 0.98 | 133155 | FVCNLLLLFVTVYSHLLLV | ORF3a protein [Severe acute respiratory syndrome coronavirus 2] | SARS-CoV2 |
ASSLFGSTDTQY | ASSLFGSTDTQY | 1.00 | 92508 | FTISVTTEIL | surface glycoprotein [Severe acute respiratory syndrome coronavirus 2] | SARS-CoV2 |
I want to keep 'input_sequence' that only match with 'organism' = SARS-CoV2 and nothing else. So in this example I would keep only line 2 and line 7 and discard lines 3,4,5,6 because here this 'input_sequence' has also a hit with Hepatitis B virus.
In total I have over 20.000 rows in my file.
results required:
input_sequence | match_sequence | score | receptor_group | epitope | antigen | organism |
---|---|---|---|---|---|---|
ASRPPGGVNEQF | ASRPPGGVNEQF | 1.00 | 25735 | EPLPQGQLTAY | Trans-activator protein BZLF1 [Severe acute respiratory syndrome coronavirus 2] | SARS-CoV2 |
ASSLFGSTDTQY | ASSLFGSTDTQY | 1.00 | 92508 | FTISVTTEIL | surface glycoprotein [Severe acute respiratory syndrome coronavirus 2] | SARS-CoV2 |
Is there a way to quickly do this using awk or bash (without writing a long script)? Any tips are welcome.
I thought to use awk to count the occurences of each value in column 1 and occurences of SARS-COV2 in column 7, and then only keep those that match... but I don't know how to do this. I only got this far (counting the number of occurrences in column one):
awk '{for(i=1;i<=NF;i )if($i ~ /^/)x ;print x;x=0}' file
Thanks!
CodePudding user response:
You may consider this awk that joins same file on 1st column:
awk -F'\t' 'NR==FNR {$NF != "SARS-CoV2" && bad[$1]; next}
FNR == 1 || !($1 in bad)' file{,} | column -s $'\t' -t
input_sequence match_sequence score receptor_group epitope antigen organism
ASRPPGGVNEQF ASRPPGGVNEQF 1.00 25735 EPLPQGQLTAY Trans-activator protein BZLF1 [Severe acute respiratory syndrome coronavirus 2] SARS-CoV2
ASSLFGSTDTQY ASSLFGSTDTQY 1.00 92508 FTISVTTEIL surface glycoprotein [Severe acute respiratory syndrome coronavirus 2] SARS-CoV2
PS: column -s $'\t' -t
has been used for tabular display only. You can remove it.
CodePudding user response:
You need to do two passes over the input.
The first pass makes an array whose keys are the input sequences that have an organism other than SARS-CoV2
. The second pass checks if the current input sequence is in that array. If not, it prints the line.
awk -F'\t' 'NR==FNR {if ($7 != "SARS-CoV2") {a[$0]=1}; next}
!a[$0]' file file
CodePudding user response:
awk '
NR==1
$NF != "SARS-CoV2" { bad[$1] }
$NF == "SARS-CoV2" { good[$1]=$0 }
END {
for(v in bad) delete good[v]
for(v in good) print good[v]
}
' file
Passing the file once, this could potentially be a solution.