I have a '.txt' file in which a list of genes are given and their sequence. I need to create a dictionary in which the keys are the names of the genes and the values are the sequences.
I want the output of the dictionary to be this: dict = ('sequence1' : 'AATTGGCC', 'sequence2' : 'AAGGCCTT', ...)
So this is what I tried, but I ran into some problems:
dictionary = {}
accesion_number = ""
sequentie = ""
with open("6EP.fasta", "r") as proteoom:
for line in proteoom:
if line.startswith(">"):
line.strip()
dictionary[accesion_number] = sequentie
sequentie = ""
else:
sequentie = sequentie line.rstrip().strip("\n").strip("\r")
dictionary[accesion_number] = sequentie
Does anyone know what went wrong here, and how I can fix it? Thanks in advance!
CodePudding user response:
I can think of two ways to do this:
High memory usage
If the file is not too large, you can use readlines()
and then use the indexes like so:
IDs = []
sequences = []
with open('Proteome.fasta', 'r') as f:
raw_data = f.readlines()
for i, l in enumerate(raw_data):
if l[0] == '>':
IDs.append(l)
sequences.append(raw_data[i 1])
Low memory usage
Now, if you don't want to load the contents of the file into memory, then I think you can read the file twice by saving the indexes of every ID line plus one, like so:
- Get the
'>'
lines and theirindexes
, which will be the ID index plus one - Compare if the line number is in the
indexes
list and, if so, then append the content to your variable
In here, I'm taking advantage of the fact that the lists are, by definition, sorted.
IDs = []
indexes = []
sequences = []
with open('Proteome.fasta', 'r') as f:
for i, l in enumerate(f):
IDs.append(l) # Get your IDs
indexes.append(i 1) # Get the index of the ID 1
with open('Proteome.fasta', 'r') as f:
for i, l in enumerate(f):
if i == indexes[0]: # Check whether line matches with the index
sequences.append(l) # Get your sequence
indexes.pop(0) # Remove the first element of the indexes
I hope this helps! ;)
CodePudding user response:
Code
ids = []
seq = []
char = ['_', ':', '*', '#'] #invalid in sequence
seqs = ''
with open('fasta.txt', 'r') as f: #open sample fasta
for line in f:
if line.startswith('>'):
ids.append(line.strip('\n'))
if seqs != '': #if there's previous seq
seq.append(seqs) #append the seq
seqs = '' #then start a new seq
elif line not in char:
seqs = seqs line.strip('\n') #build seq with each line until '>'
seq.append(seqs) #append any remaining seq
print(ids)
print(seq)
Result
['>SeqABCD [organism=Mus musculus]', '>SeqABCDE [organism=Plasmodium]']
['ACGTCAGTCACGTACGTCAGTTCAGTC...', 'GGTACTGCAAAGTTCTTCCGCCTGATTA...']
Sample File
>SeqABCD [organism=Mus musculus]
ACGTCAGTCACGTACGTCAGTTCAGTCARYSTYSATCASMBMBDH
ATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAA
AAATATCATTTGAGGTCAATACAAATCCTATTTCTATCGTTTTT
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAAT
>SeqABCDE [organism=Plasmodium falciparum]
GGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCATTTTACCTT
TTGTTTTGCTTCTTTGAAGTAGTTTCTCTTTGCAAAATTCCTCTT
GGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCGGTACTGCAA
AGTCAATTTTATATAATTTAATCAAATAAATAAGTTTATGGTTAA