Imagine that we have this data:
##sequence-region P51451 1 505
##sequence-region P22223 1 829
P22223 UniProtKB Transmembrane 655 677 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region Q01518 1 475
##sequence-region Q96MP8 1 289
##sequence-region Q9HCJ2 1 640
Q9HCJ2 UniProtKB Transmembrane 528 548 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region P48059 1 325
##sequence-region Q9UHB6 1 759
##sequence-region P16581 1 610
P16581 UniProtKB Transmembrane 557 578 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
And the final output would be to get the rows that contain the word 'transmembrane' and its corresponding top row only:
##sequence-region P22223 1 829
P22223 UniProtKB Transmembrane 655 677 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region Q9HCJ2 1 640
Q9HCJ2 UniProtKB Transmembrane 528 548 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region P16581 1 610
P16581 UniProtKB Transmembrane 557 578 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
I am trying with grep but I am a little bit stuck
Thanks!
CodePudding user response:
You might use python
for this task following way, let file.txt
content be
##sequence-region P51451 1 505
##sequence-region P22223 1 829
P22223 UniProtKB Transmembrane 655 677 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region Q01518 1 475
##sequence-region Q96MP8 1 289
##sequence-region Q9HCJ2 1 640
Q9HCJ2 UniProtKB Transmembrane 528 548 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region P48059 1 325
##sequence-region Q9UHB6 1 759
##sequence-region P16581 1 610
P16581 UniProtKB Transmembrane 557 578 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
then create file gettransmembrane.py
holding
import fileinput
for line in fileinput.input():
if "Transmembrane" in line:
print(prevline,end="")
print(line,end="")
prevline = line
then
python gettransmembrane.py file.txt
output
##sequence-region P22223 1 829
P22223 UniProtKB Transmembrane 655 677 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region Q9HCJ2 1 640
Q9HCJ2 UniProtKB Transmembrane 528 548 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region P16581 1 610
P16581 UniProtKB Transmembrane 557 578 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
Explanation: fileinput
is module from python standard library(1), for each line I do print it and previous line if it does has Transmembrane
substring, note that prevline = line
is done after printing. I do specify empty str
s as end
s because lines already have newlines at their ends.
(1) if you are limited to processing one file which name you know in advance you might elect to use simple file reading using open
, using fileinput
allows you to use more than 1 file (akin to cat
command) or using stdin
, so if you have above as output of another command you do not have to make temporary file, but can do pipe output of said command into python gettransmembrane.py
CodePudding user response:
If you've got GNU grep (the standard grep
on Linux) and your data are in the file data.txt
you can use:
grep -w Transmembrane --before-context=1 --no-group-separator data.txt
- The
-w
option will cause the match to apply to only whole words in the input. So, for instance,Transmembrane123
won't be matched. That might not be what you want. --before-context=1
causesgrep
to print one line in the input before every matched line.--no-group-separator
causesgrep
to print no separator between groups of matched line and previous line. Normally it prints a separator line containing--
.