How to optimize code for a file with 72 million entries?-CodePudding

I have asked a question on another platform (here) - it would be great to get your input in order to make my Python code run in a very short time. Currently, it has been taking more than 3 hours for a file with millions of entries.

from Bio import SeqIO
import sys


def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts1 = {}
    dicts2 = {}
    lst = []
    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            #print(record.id)
            #print(record.seq)
            #looking for the 3 prime adapter
            if "AACTGTAGGCACCATCAAT" in record.seq:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                #Only record is used to be able to save the all atributes like phred score in the fastq file
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos 19]
                umi_seq = record[adapter_pos 19:adapter_pos 19 12]
                i = record.id
                x = miRNAseq.seq umi_seq.seq
                #print((miRNAseq umi_seq).format("fastq"))
                dicts1[i]=x

        #write ids and seq in a dictionary and keep one if there are multiple seqs with miRNA-seq   UMI
        for k,v in dicts1.items():
            if v not in dicts2.values():
                dicts2[k] = v

        #making a list
        for keys in dicts2:
            lst.append(keys)

    #print(dicts1)
    #print(dicts2)
    #print(lst)


    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            #based on the saved ids in the list print the entries (miRNA   3' adapter   UMI)
            if record.id in lst:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos 19]
                umi_seq = record[adapter_pos 19:adapter_pos 19 12]
                #print(record.seq)
                #print(miRNAseq.seq)
                #print(adapter_seq.seq)
                #print(umi_seq.seq)
                #print("@" record.id)
                if len(miRNAseq.seq adapter_seq.seq umi_seq.seq) <= 50:
                    print((miRNAseq adapter_seq umi_seq).format("fastq"),end='')

                if len(miRNAseq.seq adapter_seq.seq umi_seq.seq) > 50:
                    cut = len(miRNAseq.seq adapter_seq.seq umi_seq.seq) - 50
                    print((miRNAseq adapter_seq umi_seq)[:-cut].format("fastq"), end='')

if __name__ == '__main__':
    QIAseq_UMI_correction()

CodePudding user response：

Why not have one reading, parsing and looping of the file? I have moved the code of the second loop to the first, am I missing something? Why loop twice?

from Bio import SeqIO
import sys


def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts1 = {}
    dicts2 = {}
    lst = []
    sentinel = 100

    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):

            # only for testing
            if sentinel < 0:
                break
            sentinel -= 1

            #print(record.id)
            #print(record.seq)
            #looking for the 3 prime adapter
            if "AACTGTAGGCACCATCAAT" in record.seq:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                #Only record is used to be able to save the all atributes like phred score in the fastq file
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos 19]
                umi_seq = record[adapter_pos 19:adapter_pos 19 12]
                i = record.id
                x = miRNAseq.seq umi_seq.seq
                #print((miRNAseq umi_seq).format("fastq"))

                if x not in dicts2:
                    if len(miRNAseq.seq adapter_seq.seq umi_seq.seq) <= 50:
                        print((miRNAseq adapter_seq umi_seq).format("fastq"),end='')

                    if len(miRNAseq.seq adapter_seq.seq umi_seq.seq) > 50:
                        cut = len(miRNAseq.seq adapter_seq.seq umi_seq.seq) - 50
                        print((miRNAseq adapter_seq umi_seq)[:-cut].format("fastq"), end='')

                dicts1[i]=x
                dicts2[x]=i


if __name__ == '__main__':
    QIAseq_UMI_correction()

Other suggestions:

As mentioned in comments you could time major steps to see where time can be shortened. Check out timeit for example. My suggestion is to time the SeqIO.parse, if "AACTGTAGGCACCATCAAT" in record.seq:

I suspect that a major chunk of time is spent in parsing with SeqIO.parse so your should use this once ideally.

A final suggestion is to use a smaller set of records until you have your code ready with what you need it to do. I have added a sentinel variable as an example to break out of the loop when 100 matching records have been explored.

CodePudding user response：

The things that I see on a first pass are these:

First where you check

if "AACTGTAGGCACCATCAAT" in record.seq:
    adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')

you can use the following to avoid searching through the sequence twice.

adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
if adapter_pos != -1:

The lines:

i = record.id
x = miRNAseq.seq umi_seq.seq
#print((miRNAseq umi_seq).format("fastq"))
dicts1[i]=x

can be changed to:

dicts1[record.id]=miRNAseq.seq umi_seq.seq

next the lines:

for k,v in dicts1.items():
    if v not in dicts2.values():
        dicts2[k] = v

can be changed to

dict2 = {**dict1,**dict2}

However the that change might actually come at a performance cost I'm not sure.

Next is that

for keys in dicts2:
    lst.append(keys)

can be deleted and

lst = list(dicts2.keys())

can be added outside and after the first pass through (where you have the commented out print statements.)

Finally as @Roeften suggests you can put the second chunk of code in with the first to avoid going through the whole file twice.

At least some of these suggestions will help but in reality python just isn't that fast of a language and if you want to do this sort of analysis regularly you might consider using something faster.