Home > OS >  Python: Separating txt file to multiple files using a reoccuring symbol
Python: Separating txt file to multiple files using a reoccuring symbol

Time:10-14

I have a .txt file of amino acids separated by ">node" like this:

Filename.txt :

>NODE_1 MSETLVLTRPDDWHVHLRDGAALQSVVPYTARQFARAIAMPNLKPPITTAEQAQAYRERI KFFLGTDSAPHASVMKENSVCGAGCFTALSALELYAEAFEAAGALDKLEAFASFHGADFY GLPRNTTQVTLRKTEWTLPESVPFGEAAQLKPLRGGEALRWKLD*
>NODE_2 MSTWHKVQGRPKAQARRPGRKSKDDFVTRVEHDAKNDALLQLVRAEWAMLRSDIATFRGD MVERFGKVEGEITGIKGQIDGLKGEMQGVKGEVEGLRGSLTTTQWVVGTAMALLAVVTQV PSIISAYRFPPAGSSAFPAPGSLPTVPGSPASAASAP*

I want to separate this file into two (or as many as there are nodes) files;

Filename1.txt :

>NODE MSETLVLTRPDDWHVHLRDGAALQSVVPYTARQFARAIAMPNLKPPITTAEQAQAYRERI KFFLGTDSAPHASVMKENSVCGAGCFTALSALELYAEAFEAAGALDKLEAFASFHGADFY GLPRNTTQVTLRKTEWTLPESVPFGEAAQLKPLRGGEALRWKLD*

Filename2.txt :

>NODE MSTWHKVQGRPKAQARRPGRKSKDDFVTRVEHDAKNDALLQLVRAEWAMLRSDIATFRGD MVERFGKVEGEITGIKGQIDGLKGEMQGVKGEVEGLRGSLTTTQWVVGTAMALLAVVTQV PSIISAYRFPPAGSSAFPAPGSLPTVPGSPASAASAP*

With a number after the filename

This code works, however it deletes the ">NODE" line and does not create a file for the last node (the one without a '>' afterwards).

with open('/uoa/scratch/shared/Soil_Microbiology_Group/Burkholderiaceae/Zuzanna/OptTemp/Trial_split_genes/CTOTU-1.fasta') as fo:
    op = ''
    start = 0
    cntr = 1
    for x in fo.read().split("\n"):
        if x.startswith('>'):
            if start == 1:
                with open (str(cntr)   '.fasta','w') as opf:
                    opf.write(op)
                    opf.close()
                    op = ''
                    cntr  = 1
            else:
                    start = 1
        else:
            if op == '':
                op = x
            else:
                op = op   '\n'   x
    fo.close()

I can´t seem to find the mistake. Would be thankful if you could point it out to me.

Thank you for your help!

CodePudding user response:

This would be better way. It is going to write only on odd iterations. So that, ">NODE" will be skipped and files will be created only for the real content.

 with open('filename.txt') as fo:
        cntr=1
        for i,content in enumerate(fo.read().split("\n")):
            if i%2 == 1:
                with open (str(cntr)   '.txt','w') as opf:
                    opf.write(content)
                    cntr  = 1

By the way, since you are using context manager, you dont need to close the file.

Context managers allow you to allocate and release resources precisely when you want to. It opens the file, writes some data to it and then closes it.

Please check: https://book.pythontips.com/en/latest/context_managers.html

CodePudding user response:

You could do it like this:

content = ['>NODE\n']
fileno = 1


def writefasta():
    global content, fileno
    if len(content) > 1:
        with open(f'Filename{fileno}.fasta', 'w') as fout:
            fout.write(''.join(content))
        content = ['>NODE\n']
        fileno  = 1


with open('test.fasta') as fin:
    for line in fin:
        if line.startswith('>NODE'):
            writefasta()
        else:
            content.append(line)
    writefasta()

CodePudding user response:

    with open('/uoa/scratch/shared/Soil_Microbiology_Group/Burkholderiaceae/Zuzanna/OptTemp/Trial_split_genes/CTOTU-1.fasta') as fo:
        cntr = 1
        for line in fo.readlines():
            with open (f'{str(cntr)}.fasta','w') as opf:
               opf.write(line)
               opf.close()
               op = ''
               cntr  = 1
        fo.close()
  • Related