Home > Software engineering >  How to extract specific fasta file with header and sequnce in a given file?
How to extract specific fasta file with header and sequnce in a given file?

Time:07-30

I have a fasta file as follows:

>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R

>hik \PName=EERT abs 1 \GName=EERT \Type=2 \Processed=(1|181:mature protein)
MMKFKPNPGDREGFKKRAACLCFRSEQEDEVLLVSSQTRTYSRYPDQWIVPGGGMEPEEE

>dmd \PName=YYHY abs 1 \GName=YYHY \Type=0 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS

>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G

I want to extract the fasta files with condition Type=1. So that my output looks as follows:

>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R

>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G

I tried with grep command as grep 'Type=1' file.fasta. It returned header name without the sequence as follows:

>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)

How do I get my desired output?

CodePudding user response:

If awk is an option:

awk '
BEGIN     { RS="" }            # redefine input record separator as empty string => treat consecutive non-blank lines as single record
/Type=1 / { print $0 ORS }     # if record contains string "Type=1 " then print record plus default output record separator; ORS provides the blank line
' file.fasta

This generates:

>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R

>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G

CodePudding user response:

You can try this script. It loads the file and parse it with itertools.groupby and re:

import re
from itertools import groupby

pat = re.compile(r">.*?Type=(\d )")

out = []
with open("your_file.txt", "r") as f_in:
    for v, g in groupby(f_in, lambda s: s.strip() == ""):
        if not v:
            g = list(g)
            if pat.match(g[0]).group(1) == "1":
                out.append("".join(g).strip())

print("\n\n".join(out))

Prints:

>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R

>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G

CodePudding user response:

A sed one-liner would do the job, assuming blocks are separated by empty lines, as in the sample input file:

sed '/\\Type=1 /,/^$/!d' file
  • Related