text processing question: remove rows where one column does not contain a particular value-CodePudding

I have a tab delimited file that looks like this:

input_sequence	match_sequence	score	receptor_group	epitope	antigen	organism
ASRPPGGVNEQF	ASRPPGGVNEQF	1.00	25735	EPLPQGQLTAY	Trans-activator protein BZLF1 [Severe acute respiratory syndrome coronavirus 2]	SARS-CoV2
ASSYSGGYEQY	ASSYSGGYEQY	1.00	33843	KTAYSHLSTSK	polymerase Hepatitis B virus (hepatitis B virus (HBV))	hep b
ASSYSGGYEQY	ASSYSGGYEQY	1.00	131430	KLSYGIATV	orf1ab polyprotein [Severe acute respiratory syndrome coronavirus 2]	SARS-CoV2
ASSYSGGYEQY	ASSFSGGYEQY	0.97	82603	FTISVTTEIL	surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]	SARS-CoV2
ASSYSGGYEQY	ASSYAGGYEQY	0.98	133155	FVCNLLLLFVTVYSHLLLV	ORF3a protein [Severe acute respiratory syndrome coronavirus 2]	SARS-CoV2
ASSLFGSTDTQY	ASSLFGSTDTQY	1.00	92508	FTISVTTEIL	surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]	SARS-CoV2

I want to keep 'input_sequence' that only match with 'organism' = SARS-CoV2 and nothing else. So in this example I would keep only line 2 and line 7 and discard lines 3,4,5,6 because here this 'input_sequence' has also a hit with Hepatitis B virus.

In total I have over 20.000 rows in my file.

results required:

input_sequence	match_sequence	score	receptor_group	epitope	antigen	organism
ASRPPGGVNEQF	ASRPPGGVNEQF	1.00	25735	EPLPQGQLTAY	Trans-activator protein BZLF1 [Severe acute respiratory syndrome coronavirus 2]	SARS-CoV2
ASSLFGSTDTQY	ASSLFGSTDTQY	1.00	92508	FTISVTTEIL	surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]	SARS-CoV2

Is there a way to quickly do this using awk or bash (without writing a long script)? Any tips are welcome.

I thought to use awk to count the occurences of each value in column 1 and occurences of SARS-COV2 in column 7, and then only keep those that match... but I don't know how to do this. I only got this far (counting the number of occurrences in column one):

awk '{for(i=1;i<=NF;i  )if($i ~ /^/)x  ;print x;x=0}' file

Thanks!

CodePudding user response：

You may consider this awk that joins same file on 1st column:

awk -F'\t' 'NR==FNR {$NF != "SARS-CoV2" && bad[$1]; next}
FNR == 1 || !($1 in bad)' file{,} | column -s $'\t' -t

input_sequence  match_sequence  score  receptor_group  epitope      antigen                                                                          organism
ASRPPGGVNEQF    ASRPPGGVNEQF    1.00   25735           EPLPQGQLTAY  Trans-activator protein BZLF1 [Severe acute respiratory syndrome coronavirus 2]  SARS-CoV2
ASSLFGSTDTQY    ASSLFGSTDTQY    1.00   92508           FTISVTTEIL   surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]           SARS-CoV2

PS: column -s $'\t' -t has been used for tabular display only. You can remove it.

CodePudding user response：

You need to do two passes over the input.

The first pass makes an array whose keys are the input sequences that have an organism other than SARS-CoV2. The second pass checks if the current input sequence is in that array. If not, it prints the line.

awk -F'\t' 'NR==FNR {if ($7 != "SARS-CoV2") {a[$0]=1}; next}
            !a[$0]' file file

CodePudding user response：

awk '
    NR==1
    $NF != "SARS-CoV2" { bad[$1] }
    $NF == "SARS-CoV2" { good[$1]=$0 }
    END {
        for(v in bad) delete good[v]
        for(v in good) print good[v]
    }
' file

Passing the file once, this could potentially be a solution.