I have a fasta file as follows:
>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R
>hik \PName=EERT abs 1 \GName=EERT \Type=2 \Processed=(1|181:mature protein)
MMKFKPNPGDREGFKKRAACLCFRSEQEDEVLLVSSQTRTYSRYPDQWIVPGGGMEPEEE
>dmd \PName=YYHY abs 1 \GName=YYHY \Type=0 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G
I want to extract the fasta files with condition Type=1
. So that my output looks as follows:
>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R
>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G
I tried with grep command as grep 'Type=1' file.fasta
. It returned header name without the sequence as follows:
>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
How do I get my desired output?
CodePudding user response:
If awk
is an option:
awk '
BEGIN { RS="" } # redefine input record separator as empty string => treat consecutive non-blank lines as single record
/Type=1 / { print $0 ORS } # if record contains string "Type=1 " then print record plus default output record separator; ORS provides the blank line
' file.fasta
This generates:
>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R
>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G
CodePudding user response:
You can try this script. It loads the file and parse it with itertools.groupby
and re
:
import re
from itertools import groupby
pat = re.compile(r">.*?Type=(\d )")
out = []
with open("your_file.txt", "r") as f_in:
for v, g in groupby(f_in, lambda s: s.strip() == ""):
if not v:
g = list(g)
if pat.match(g[0]).group(1) == "1":
out.append("".join(g).strip())
print("\n\n".join(out))
Prints:
>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R
>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G
CodePudding user response:
A sed
one-liner would do the job, assuming blocks are separated by empty lines, as in the sample input file:
sed '/\\Type=1 /,/^$/!d' file