Home > OS >  text processing question: remove rows where one column does not contain a particular value
text processing question: remove rows where one column does not contain a particular value

Time:02-11

I have a tab delimited file that looks like this:

input_sequence match_sequence score receptor_group epitope antigen organism
ASRPPGGVNEQF ASRPPGGVNEQF 1.00 25735 EPLPQGQLTAY Trans-activator protein BZLF1 [Severe acute respiratory syndrome coronavirus 2] SARS-CoV2
ASSYSGGYEQY ASSYSGGYEQY 1.00 33843 KTAYSHLSTSK polymerase Hepatitis B virus (hepatitis B virus (HBV)) hep b
ASSYSGGYEQY ASSYSGGYEQY 1.00 131430 KLSYGIATV orf1ab polyprotein [Severe acute respiratory syndrome coronavirus 2] SARS-CoV2
ASSYSGGYEQY ASSFSGGYEQY 0.97 82603 FTISVTTEIL surface glycoprotein [Severe acute respiratory syndrome coronavirus 2] SARS-CoV2
ASSYSGGYEQY ASSYAGGYEQY 0.98 133155 FVCNLLLLFVTVYSHLLLV ORF3a protein [Severe acute respiratory syndrome coronavirus 2] SARS-CoV2
ASSLFGSTDTQY ASSLFGSTDTQY 1.00 92508 FTISVTTEIL surface glycoprotein [Severe acute respiratory syndrome coronavirus 2] SARS-CoV2

I want to keep 'input_sequence' that only match with 'organism' = SARS-CoV2 and nothing else. So in this example I would keep only line 2 and line 7 and discard lines 3,4,5,6 because here this 'input_sequence' has also a hit with Hepatitis B virus.

In total I have over 20.000 rows in my file.

results required:

input_sequence match_sequence score receptor_group epitope antigen organism
ASRPPGGVNEQF ASRPPGGVNEQF 1.00 25735 EPLPQGQLTAY Trans-activator protein BZLF1 [Severe acute respiratory syndrome coronavirus 2] SARS-CoV2
ASSLFGSTDTQY ASSLFGSTDTQY 1.00 92508 FTISVTTEIL surface glycoprotein [Severe acute respiratory syndrome coronavirus 2] SARS-CoV2

Is there a way to quickly do this using awk or bash (without writing a long script)? Any tips are welcome.

I thought to use awk to count the occurences of each value in column 1 and occurences of SARS-COV2 in column 7, and then only keep those that match... but I don't know how to do this. I only got this far (counting the number of occurrences in column one):

awk '{for(i=1;i<=NF;i  )if($i ~ /^/)x  ;print x;x=0}' file

Thanks!

CodePudding user response:

You may consider this awk that joins same file on 1st column:

awk -F'\t' 'NR==FNR {$NF != "SARS-CoV2" && bad[$1]; next}
FNR == 1 || !($1 in bad)' file{,} | column -s $'\t' -t

input_sequence  match_sequence  score  receptor_group  epitope      antigen                                                                          organism
ASRPPGGVNEQF    ASRPPGGVNEQF    1.00   25735           EPLPQGQLTAY  Trans-activator protein BZLF1 [Severe acute respiratory syndrome coronavirus 2]  SARS-CoV2
ASSLFGSTDTQY    ASSLFGSTDTQY    1.00   92508           FTISVTTEIL   surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]           SARS-CoV2

PS: column -s $'\t' -t has been used for tabular display only. You can remove it.

CodePudding user response:

You need to do two passes over the input.

The first pass makes an array whose keys are the input sequences that have an organism other than SARS-CoV2. The second pass checks if the current input sequence is in that array. If not, it prints the line.

awk -F'\t' 'NR==FNR {if ($7 != "SARS-CoV2") {a[$0]=1}; next}
            !a[$0]' file file

CodePudding user response:

awk '
    NR==1
    $NF != "SARS-CoV2" { bad[$1] }
    $NF == "SARS-CoV2" { good[$1]=$0 }
    END {
        for(v in bad) delete good[v]
        for(v in good) print good[v]
    }
' file

Passing the file once, this could potentially be a solution.

  • Related