Home > Enterprise >  Change ID in multiple FASTA files
Change ID in multiple FASTA files

Time:09-06

I need to rename multiple sequences in multiple fasta files and I found this script in order to do so for a single ID:


original_file = "./original.fasta"
corrected_file = "./corrected.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')
    for record in records:
        print record.id            
        if record.id == 'foo':
            record.id = 'bar'
            record.description = 'bar' # <- Add this line
        print record.id 
        SeqIO.write(record, corrected, 'fasta') 

Each fasta file corresponds to a single organism, but it is not specified in the IDs. I have the original fasta files (because these have been translated) with the same filenames but different directories and include in their IDs the name of each organism. I wanted to figure out how to loop through all these fasta files and rename each ID in each file with the corresponding organism name.

CodePudding user response:

ok my effort, got to use my own input folders/files since they where not specified in question

/old folder contains files :

MW628877.1.fasta :

>MW628877.1 Streptococcus agalactiae strain RYG82 DNA gyrase subunit A (gyrA) gene, complete cds
ATGCAAGATAAAAATTTAGTAGATGTTAATCTAACTAGTGAAATGAAAACGAGTTTTATCGATTACGCCA
TGAGTGTCATTGTTGCTCGTGCACTTCCAGATGTTAGAGATGGTTTAAAACCTGTTCATCGTCGTATTTT
>KY347969.1 Neisseria gonorrhoeae strain 1448 DNA gyrase subunit A (gyrA) gene, partial cds
CGGCGCGTACCGTACGCGATGCACGAGCTGAAAAATAACTGGAATGCCGCCTACAAAAAATCGGCGCGCA
TCGTCGGCGACGTCATCGGTAAATACCACCCCCACGGCGATTTCGCAGTTTACGGCACCATCGTCCGTAT

MG995190.1.fasta :

>MG995190.1 Mycobacterium tuberculosis strain UKR100 GyrA (gyrA) gene, complete cds
ATGACAGACACGACGTTGCCGCCTGACGACTCGCTCGACCGGATCGAACCGGTTGACATCCAGCAGGAGA
TGCAGCGCAGCTACATCGACTATGCGATGAGCGTGATCGTCGGCCGCGCGCTGCCGGAGGTGCGCGACGG

and an /empty folder.

/new folder contains files :

MW628877.1.fasta :

>MW628877.1
MQDKNLVDVNLTSEMKTSFIDYAMSVIVARALPDVRDGLKPVHRRI
>KY347969.1
RRVPYAMHELKNNWNAAYKKSARIVGDVIGKYHPHGDFAVYGTIVR

MG995190.1.fasta :

>MG995190.1
MTDTTLPPDDSLDRIEPVDIQQEMQRSYIDYAMSVIVGRALPEVRD

my code is :

from Bio import SeqIO
from os import scandir
old = './old'

new = './new'


old_ids_dict = {}

for filename in scandir(old):
    
    if filename.is_file():
        
        print(filename)
        
        for seq_record in SeqIO.parse(filename, "fasta"):
            
            
            old_ids_dict[seq_record.id] = ' '.join(seq_record.description.split(' ')[1:3])
            
print('_____________________')

print('old ids ---> ',old_ids_dict)

print('_____________________')

for filename in scandir(new):
    
    if filename.is_file():
        
        sequences = []
        
        for seq_record in SeqIO.parse(filename, "fasta"):

            if seq_record.id in old_ids_dict.keys():
                
                print('@@@ ', seq_record.id,'    ', old_ids_dict[seq_record.id])
                
                seq_record.id  = '.' old_ids_dict[seq_record.id]
                
                seq_record.description = ''
                
                print('-->', seq_record.id)
                
            
            print(seq_record)
            
            sequences.append(seq_record)
        
        SeqIO.write(sequences, filename, 'fasta') 

check how it works, it actually overwrites both files in new folder,

as pointed out by @Vovin in his comment it needs to be adapted per your files template from-to.

I am sure there is more than a way to do this, probably better and more pythonic than may way, I am learning too. Let us know

  • Related