How to only run certain lines of text file through dictionary and keep everything else the same-CodePudding

For my Computational Biology final project, I am required to take a DNA sequence, translating it into RNA, and then translate that into a protein structure. Below is an example (2dna.fasta, what is running through my code right now)

>ENST00000632684.1
GGGACAGGGGGC
>ENST00000434970.2
CCTTCCTAC

Anything that starts with a dash is metadata while everything else is a protein sequence. I could do it so every other lines is translated but for the second part of the final project, the file looks like this

>ENST00000651352.1 cds chromosome:GRCh38:3:3126963:3148101:1 gene:ENSG00000072756.17 gene_biotype:protein_coding transcript_biotype:nonsense_mediated_decay gene_symbol:TRNT1 description:tRNA nucleotidyl transferase 1 [Source:HGNC Symbol;Acc:HGNC:17341]
ATGCTGAGGTGCCTGTATCATTGGCACAGGCCAGTGCTGAACCGTAGGTGGAGTAGGCTG
TGCCTTCCGAAGCAGTATCTATTCACAATGAAGTTGCAGTCTCCCGAATTCCAGTCACTT
TTCACAGAAGGACTGAAGAGTCTGACAGAATTATTTGTCAAAGAGAATCACGAATTAAGA
ATAGCAGGAGGAGCAGTGAGGGATTTATTAAATGGAGTAAAGCCTCAGGATATAGATTTT
GCCACCACTGCTACCCCTACTCAAATGAAGGAGATGTTTCAGTCGGCTGGGATTCGGATG
ATAAACAACAGAGGAGAAAAGCACGGAACAATTACTGCCAGGGTTTTGATGGCACTTTAT
TTGACTACTTTAATGGTTATGAAGATTTAA
>ENST00000434583.5 cds chromosome:GRCh38:3:3126965:3150879:1 gene:ENSG00000072756.17 gene_biotype:protein_coding transcript_biotype:nonsense_mediated_decay gene_symbol:TRNT1 description:tRNA nucleotidyl transferase 1 [Source:HGNC Symbol;Acc:HGNC:17341]

My original solution was to just remove the lines that started with dash but this would not include the metadata that I need. Is it possible to somehow separate the meta data from the from the dna, run the DNA data through the dictionary, and then put the dna in between the metadata (like where it was before it went through the dictionary).

As stated above, I have tried removing lines that start with '>' but that can only work if I didn't need the meta data. I do need the meta data. I have also tried making it so it only reads lines that start with 'ATG' as majority of the DNA strands star with ATG but for the beginning of the second part of the project, the DNA does not start with ATG for around a 100 lines.

import sys
file = open('2dna.fasta' , 'r')

DNASequence = ''
for lines in file.readlines():
    if not (lines.startswith('>')):
        DNASequence = DNASequence    lines 
    
DNASequence = DNASequence.replace('\n', '')
print('The original DNA sequence is', DNASequence)

CompletmentDict = {'A':'T', 'T':'A', 'G':'C', 'C' : 'G'}
final = ""

for letter in DNASequence:    
    final  = CompletmentDict[letter]
    
print ("Your completement is: ", final)

final2 = "" 
    
DNATORNADICT = {'A':'U', 'T':'A', 'G':'C', 'C' : 'G'}

for letters in final:
    final2  =  DNATORNADICT[letters]

print("Your Final DNA TO RNA TRANSCRIPTION IS: "   final2)

rna2protein = {'UUU':'F', 'UUC':'F', 'UUA':'L', 'UUG':'L',
'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S',
'UAU':'Y', 'UAC':'Y', 'UAA':'', 'UAG':'',
'UGU':'C', 'UGC':'C', 'UGA':'', 'UGG':'W',
'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L',
'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P',
'CAU':'H', 'CAC':'H', 'CAA':'Q', 'CAG':'Q',
'CGU':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R',
'AUU':'I', 'AUC':'I', 'AUA':'I', 'AUG':'M',
'ACU':'T', 'ACC':'T', 'ACA':'T', 'ACG':'T',
'AAU':'N', 'AAC':'N', 'AAA':'K', 'AAG':'K',
'AGU':'S', 'AGC':'S', 'AGA':'R', 'AGG':'R',
'GUU':'V', 'GUC':'V', 'GUA':'V', 'GUG':'V',
'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A',
'GAU':'D', 'GAC':'D', 'GAA':'E', 'GAG':'E',
'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G'}

final3 = ""

for p in range(0,len(final2),3):
    myKey = final2[p:p 3]
    final3  = rna2protein.get(myKey)
    
print("Resulting protein is: ", final3)

proteinSeq = open('proteinSeq.txt', 'w')
proteinSeq.write(final3)
proteinSeq.close()

The output of this is

The original DNA sequence is GGGACAGGGGGCCCTTCCTAC
Your completement is:  CCCTGTCCCCCGGGAAGGATG
Your Final DNA TO RNA TRANSCRIPTION IS: GGGACAGGGGGCCCUUCCUAC
Resulting protein is:  GTGGPSY

and in my results file it looks like

GTGGPSY

but i want it to be like

>ENST00000632684.1
GTGG
>ENST00000434970.2
PSY

How could I do this? If you need any clarification on what any of this means, let me know

CodePudding user response：

I made some adjustement on your code :

import sys
file = open('2dna.fasta' , 'r')
proteinSeq = open('proteinSeq.txt', 'a')

metaData = []
DNAs = []

DNASequence = ''
for lines in file.readlines():
    if not (lines.startswith('>')):
        DNASequence = DNASequence    lines 
    else:
        DNASequence = DNASequence.replace('\n', '')
        DNAs.append(DNASequence)
        metaData.append(lines.split(" ")[0])
        DNASequence = ""

def showData(metaData,DNASequence):
    print('The original DNA sequence is', DNASequence)

    CompletmentDict = {'A':'T', 'T':'A', 'G':'C', 'C' : 'G'}
    final = ""

    for letter in DNASequence:    
        final  = CompletmentDict[letter]
        
    print ("Your completement is: ", final)

    final2 = "" 
        
    DNATORNADICT = {'A':'U', 'T':'A', 'G':'C', 'C' : 'G'}

    for letters in final:
        final2  =  DNATORNADICT[letters]

    print("Your Final DNA TO RNA TRANSCRIPTION IS: "   final2)

    rna2protein = {'UUU':'F', 'UUC':'F', 'UUA':'L', 'UUG':'L',
    'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S',
    'UAU':'Y', 'UAC':'Y', 'UAA':'', 'UAG':'',
    'UGU':'C', 'UGC':'C', 'UGA':'', 'UGG':'W',
    'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L',
    'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P',
    'CAU':'H', 'CAC':'H', 'CAA':'Q', 'CAG':'Q',
    'CGU':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R',
    'AUU':'I', 'AUC':'I', 'AUA':'I', 'AUG':'M',
    'ACU':'T', 'ACC':'T', 'ACA':'T', 'ACG':'T',
    'AAU':'N', 'AAC':'N', 'AAA':'K', 'AAG':'K',
    'AGU':'S', 'AGC':'S', 'AGA':'R', 'AGG':'R',
    'GUU':'V', 'GUC':'V', 'GUA':'V', 'GUG':'V',
    'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A',
    'GAU':'D', 'GAC':'D', 'GAA':'E', 'GAG':'E',
    'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G'}

    final3 = ""

    for p in range(0,len(final2),3):
        myKey = final2[p:p 3]
        final3  = rna2protein.get(myKey)
        
    print("Resulting protein is: ", final3)
    proteinSeq.write("\n "   metaData   "\n "   final3)

for i in range(len(DNAs)):
    showData(metaData[i],DNAs[i])

proteinSeq.close()

CodePudding user response：

I don't have any domain specific knowledge about what you need to accomplish so let me know if I'm misunderstanding anything.

It looks like there is a line of metadata followed by the DNA sequence it relates to. If this is the case, I think it would be helpful to split these into separate entries that you can process sequentially.

Assuming that the > character only appears in front of the metadata lines and nowhere else in the file, you could use this as a delimiter to split the string:



my_sequences = []
with open('myfile.txt' , 'r') as file:
  # split along the metadata delimiter
  for entry in file.read().split('>')[1:]:
    # split the entry at the first newline character to separate the metadata and sequence
    [meta, seq] = (entry.split("\n", 1))
    my_sequences.append({"meta":meta, "seq": seq})

with open('proteinSeq.txt', 'w') as proteinSeq:
  for entry in my_sequences:
    DNASequence = entry["seq"].replace("\n", "")
    print('The original DNA sequence is', DNASequence)

    CompletmentDict = {'A':'T', 'T':'A', 'G':'C', 'C' : 'G'}
    final = ""

    for letter in DNASequence:    
        final  = CompletmentDict[letter]
    
    print ("Your completement is: ", final)

    final2 = "" 
        
    DNATORNADICT = {'A':'U', 'T':'A', 'G':'C', 'C' : 'G'}

    for letters in final:
        final2  =  DNATORNADICT[letters]

    print("Your Final DNA TO RNA TRANSCRIPTION IS: "   final2)

    rna2protein = {'UUU':'F', 'UUC':'F', 'UUA':'L', 'UUG':'L',
    'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S',
    'UAU':'Y', 'UAC':'Y', 'UAA':'', 'UAG':'',
    'UGU':'C', 'UGC':'C', 'UGA':'', 'UGG':'W',
    'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L',
    'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P',
    'CAU':'H', 'CAC':'H', 'CAA':'Q', 'CAG':'Q',
    'CGU':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R',
    'AUU':'I', 'AUC':'I', 'AUA':'I', 'AUG':'M',
    'ACU':'T', 'ACC':'T', 'ACA':'T', 'ACG':'T',
    'AAU':'N', 'AAC':'N', 'AAA':'K', 'AAG':'K',
    'AGU':'S', 'AGC':'S', 'AGA':'R', 'AGG':'R',
    'GUU':'V', 'GUC':'V', 'GUA':'V', 'GUG':'V',
    'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A',
    'GAU':'D', 'GAC':'D', 'GAA':'E', 'GAG':'E',
    'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G'}

    final3 = ""

    for p in range(0,len(final2),3):
        myKey = final2[p:p 3]
        final3  = rna2protein.get(myKey)
        
    print("Resulting protein is: ", final3)
    proteinSeq.write(f""">{entry["meta"]}\n""")
    proteinSeq.write(f"{final3}\n")
  proteinSeq.close()

You'll end up with an array of dictionaries that you can loop over where each dictionary has a separate key value pair for the metadata and the following sequence.

CodePudding user response：

If I understood correctly what you want (based on your expected output), this does the job:

I wrapped the DNA transcoding into a function.
The main part of the code checks for metadata lines; if found, it transcodes the previous (if any) DNA sequence and writes the protein to file, then writes the metadata line and proceeds to the next DNA block.

    def treat_DNA(seq):
        print('The original DNA sequence is', seq)
      
        CompletmentDict = {'A':'T', 'T':'A', 'G':'C', 'C' : 'G'}
        final = ""
        for letter in seq:    
            final  = CompletmentDict[letter]  
        print ("Your completement is: ", final)
      
        final2 = ""   
        DNATORNADICT = {'A':'U', 'T':'A', 'G':'C', 'C' : 'G'}
        for letters in final:
            final2  =  DNATORNADICT[letters]
        print("Your Final DNA TO RNA TRANSCRIPTION IS: "   final2)
      
        rna2protein = {'UUU':'F', 'UUC':'F', 'UUA':'L', 'UUG':'L',
        'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S',
        'UAU':'Y', 'UAC':'Y', 'UAA':'', 'UAG':'',
        'UGU':'C', 'UGC':'C', 'UGA':'', 'UGG':'W',
        'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L',
        'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P',
        'CAU':'H', 'CAC':'H', 'CAA':'Q', 'CAG':'Q',
        'CGU':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R',
        'AUU':'I', 'AUC':'I', 'AUA':'I', 'AUG':'M',
        'ACU':'T', 'ACC':'T', 'ACA':'T', 'ACG':'T',
        'AAU':'N', 'AAC':'N', 'AAA':'K', 'AAG':'K',
        'AGU':'S', 'AGC':'S', 'AGA':'R', 'AGG':'R',
        'GUU':'V', 'GUC':'V', 'GUA':'V', 'GUG':'V',
        'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A',
        'GAU':'D', 'GAC':'D', 'GAA':'E', 'GAG':'E',
        'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G'}
        final3 = ""
        for p in range(0,len(final2),3):
            myKey = final2[p:p 3]
            final3  = rna2protein.get(myKey)
        print("Resulting protein is: ", final3)
      
        with open('proteinSeq.txt', 'a') as file:
            file.write(final3 '\n')
    
    file = open('2dna.fasta' , 'r')
    
    DNASequence = ''
    for line in file.readlines():
        if line.startswith('>'):
            if DNASequence:
                treat_DNA(DNASequence)
            DNASequence = ''
            with open('proteinSeq.txt', 'a') as file:
                file.write(line)
        else:
            DNASequence  = line.strip() 
    treat_DNA(DNASequence)

CodePudding user response：

If I understand correctly you can just keep two versions of the string, one with the meta data and one without (if you actually need the one without), while keeping the newline "\n". Then loop through each line, and check if next character is "<" and if its is, just add the line without dictionary, if it is not, go through each character in the line.

Also, consider giving your variables better names ;)

This is not working with the examples you gave, but should guide you in the right direction:

# read through the dna string and replace each character with its rna counterpart
for line in dna.split("\n"):
    if not line.startswith(">"):
        for char in line:
            rna  = DNATORNADICT[char]
        rna  = "\n"
    else:
        rna  = line
        rna  = "\n"

protein = ""

# read through the rna string and replace each codon with its protein counterpart
for line in rna.split("\n"):
    if line.startswith(">"):
        protein  = line   "\n"
    else:
        for i in range(0, len(line), 3):
            codon = line[i:i 3]
            protein  = rna2protein.get(codon) # not sure why you use .get here
        protein  = "\n"

print(protein)