How to select a row by name and also the previous row in bash or python?-CodePudding

Imagine that we have this data:

##sequence-region P51451 1 505
##sequence-region P22223 1 829
P22223  UniProtKB   Transmembrane   655 677 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region Q01518 1 475
##sequence-region Q96MP8 1 289
##sequence-region Q9HCJ2 1 640
Q9HCJ2  UniProtKB   Transmembrane   528 548 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region P48059 1 325
##sequence-region Q9UHB6 1 759
##sequence-region P16581 1 610
P16581  UniProtKB   Transmembrane   557 578 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255

And the final output would be to get the rows that contain the word 'transmembrane' and its corresponding top row only:

##sequence-region P22223 1 829
P22223  UniProtKB   Transmembrane   655 677 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region Q9HCJ2 1 640
Q9HCJ2  UniProtKB   Transmembrane   528 548 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region P16581 1 610
P16581  UniProtKB   Transmembrane   557 578 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255

I am trying with grep but I am a little bit stuck

Thanks!

CodePudding user response：

You might use python for this task following way, let file.txt content be

##sequence-region P51451 1 505
##sequence-region P22223 1 829
P22223  UniProtKB   Transmembrane   655 677 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region Q01518 1 475
##sequence-region Q96MP8 1 289
##sequence-region Q9HCJ2 1 640
Q9HCJ2  UniProtKB   Transmembrane   528 548 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region P48059 1 325
##sequence-region Q9UHB6 1 759
##sequence-region P16581 1 610
P16581  UniProtKB   Transmembrane   557 578 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255

then create file gettransmembrane.py holding

import fileinput
for line in fileinput.input():
    if "Transmembrane" in line:
        print(prevline,end="")
        print(line,end="")
    prevline = line

then

python gettransmembrane.py file.txt

output

##sequence-region P22223 1 829
P22223  UniProtKB   Transmembrane   655 677 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region Q9HCJ2 1 640
Q9HCJ2  UniProtKB   Transmembrane   528 548 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region P16581 1 610
P16581  UniProtKB   Transmembrane   557 578 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255

Explanation: fileinput is module from python standard library(1), for each line I do print it and previous line if it does has Transmembrane substring, note that prevline = line is done after printing. I do specify empty strs as ends because lines already have newlines at their ends.

(1) if you are limited to processing one file which name you know in advance you might elect to use simple file reading using open, using fileinput allows you to use more than 1 file (akin to cat command) or using stdin, so if you have above as output of another command you do not have to make temporary file, but can do pipe output of said command into python gettransmembrane.py

CodePudding user response：

If you've got GNU grep (the standard grep on Linux) and your data are in the file data.txt you can use:

grep -w Transmembrane --before-context=1 --no-group-separator data.txt

The -w option will cause the match to apply to only whole words in the input. So, for instance, Transmembrane123 won't be matched. That might not be what you want.
--before-context=1 causes grep to print one line in the input before every matched line.
--no-group-separator causes grep to print no separator between groups of matched line and previous line. Normally it prints a separator line containing --.