Creating dictionary from a '.fasta' file containing several genes from an organism-CodePudding

I have a '.txt' file in which a list of genes are given and their sequence. I need to create a dictionary in which the keys are the names of the genes and the values are the sequences.

I want the output of the dictionary to be this: dict = ('sequence1' : 'AATTGGCC', 'sequence2' : 'AAGGCCTT', ...)

So this is what I tried, but I ran into some problems:

dictionary = {}

accesion_number = ""
sequentie = ""

with open("6EP.fasta", "r") as proteoom:
    for line in proteoom:
        if line.startswith(">"):
            line.strip()
            dictionary[accesion_number] = sequentie
            sequentie = ""
        else:
            sequentie = sequentie   line.rstrip().strip("\n").strip("\r")
    dictionary[accesion_number] = sequentie

Does anyone know what went wrong here, and how I can fix it? Thanks in advance!

CodePudding user response：

I can think of two ways to do this:

High memory usage

If the file is not too large, you can use readlines() and then use the indexes like so:

IDs = []
sequences = []

with open('Proteome.fasta', 'r') as f:
    raw_data = f.readlines()

for i, l in enumerate(raw_data):
    if l[0] == '>':
        IDs.append(l)
        sequences.append(raw_data[i   1])

Low memory usage

Now, if you don't want to load the contents of the file into memory, then I think you can read the file twice by saving the indexes of every ID line plus one, like so:

Get the '>' lines and their indexes, which will be the ID index plus one
Compare if the line number is in the indexes list and, if so, then append the content to your variable

In here, I'm taking advantage of the fact that the lists are, by definition, sorted.

IDs = []
indexes = []
sequences = []

with open('Proteome.fasta', 'r') as f:
    for i, l in enumerate(f):
        IDs.append(l)                  # Get your IDs
        indexes.append(i   1)          # Get the index of the ID   1

with open('Proteome.fasta', 'r') as f:
    for i, l in enumerate(f):
        if i == indexes[0]:            # Check whether line matches with the index
            sequences.append(l)        # Get your sequence
            indexes.pop(0)             # Remove the first element of the indexes

I hope this helps! ;)

CodePudding user response：

Code

ids = []
seq = []
char = ['_', ':', '*', '#']                #invalid in sequence
seqs = ''

with open('fasta.txt', 'r') as f:          #open sample fasta
  for line in f:
    if line.startswith('>'):               
      ids.append(line.strip('\n'))
      if seqs != '':                       #if there's previous seq
        seq.append(seqs)                   #append the seq
        seqs = ''                          #then start a new seq
    elif line not in char:                
      seqs = seqs   line.strip('\n')       #build seq with each line until '>'
  seq.append(seqs)                         #append any remaining seq

print(ids)
print(seq)

Result

['>SeqABCD [organism=Mus musculus]', '>SeqABCDE [organism=Plasmodium]']
['ACGTCAGTCACGTACGTCAGTTCAGTC...', 'GGTACTGCAAAGTTCTTCCGCCTGATTA...']

Sample File

>SeqABCD [organism=Mus musculus]
ACGTCAGTCACGTACGTCAGTTCAGTCARYSTYSATCASMBMBDH
ATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAA
AAATATCATTTGAGGTCAATACAAATCCTATTTCTATCGTTTTT
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAAT
>SeqABCDE [organism=Plasmodium falciparum]
GGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCATTTTACCTT
TTGTTTTGCTTCTTTGAAGTAGTTTCTCTTTGCAAAATTCCTCTT
GGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCGGTACTGCAA
AGTCAATTTTATATAATTTAATCAAATAAATAAGTTTATGGTTAA