Home > Enterprise >  Cutting out dictionary file defined regions of a FASTA gene sequence
Cutting out dictionary file defined regions of a FASTA gene sequence

Time:10-01

Very new to Python and coding in general so feel free to laugh. I want to use a txt file (dict) in the following format with genes in the first column and the region of the sequence (start position end position)

ORFB    21563 25384
ORF3a   25393 26220
ORF2a   26245 26472
ORF10   29558 29674
S   21563 25384
E   26245 26472

to read a FASTA DNA file from Genbank (GENE.fasta.txt) so that the output would be the gene name and then the sequence between start and stop for each gene.

I tried the following...no luck. I would really like to learn rather than just be given the code. Any help is greatly appreciated.

with open('dict.txt') as f:
    ranges = {ID: (int(start), int(stop)) for ID, start, stop in map(lambda s: s.strip().split(), f)}

from Bio import SeqIO
with open ('GENE.fasta.txt') as handle:
    out = [r[slice(*ranges[r.id])] for r in SeqIO.parse(handle, 'fasta')]

with open('output.fasta', 'w') as handle:
    SeqIO.write(out, handle, 'fasta')

CodePudding user response:

You are applying the map to the file object. You have to do something like this:

with open('dict.txt') as f:
    ranges = {
        ID: (int(start), int(stop))
        for ID, start, stop
        in map(lambda line: line.strip().split(), f.readlines())
    }

The readlines method returns list of lines in your file. And maybe is more readable to use list comprehension: [line.strip().split() for line in f.readlines()]

I hope I helped

CodePudding user response:

apparently MN908947.3 in not an ID that you built with your first step

so when you try and do ranges[r.id] it does not exist and throws an error

you could maybe do something more like

with open ('GENE.fasta.txt') as handle:
    out = [r[slice(*ranges[r.id])] for r in SeqIO.parse(handle, 'fasta') if r.id in ranges]

this will only get ranges that you identified in your first step ... it is possible you are not building your ids correctly, but there is not enough info in your question to assert anything other than that is also a possibility of why it might be failing (ie if you think MN908947.3 is in your first step)

  • Related