How to speed up nested for loop with string values in Python-CodePudding

I want to loop through a dictionary of values that contain protein sequences in letters. The dictionary could be as large as 200,000 keys.

Using a for loop, it takes way too long. Below is my current implementation.

def find_msa_diversity(seq_dict):
     """Find the diversity of the MSA"""
     summation = 0
     for n in seq_dict.values():
          count = 0
          for m in seq_dict.values():
               start = time.time()
               if m == n:
                    continue
               if seq_similarity(n, m) > 0.8:
                    count  = 1
                    print(f"seq_count: {count}")

This is the seq_similarity function:

def seq_similarity(n,m):
     counter = 0
     for i in range(len(seq_n)):
          if seq_n[i] == seq_m[i]:
               counter  = 1
     return 1 if counter/len(seq_n) > 0.8 else 0

I also tried to use np.array here:

def seq_similarity(n,m):
     similarity_count = sum(np.array(seq_n) == np.array(seq_m))
     return similarity_count/len(seq_n)

However, the former code seems even faster than the second.

How can I optimise the "find_msa_diversity" first and foremost (I thought about using numpy vectorisation too, but it I hear it won't be much of a difference if the values are strings instead of numbers. I am also thinking about using Python Multiprocessing.)

I will appreciate your contributions. Thank you.

CodePudding user response：

You doing redundant and repeating work, take this example Lets assume you have 3 sequences in your dictionary S1, S2, S3

Your algorithm will check

S1 S1 (You already know this is 1.0 because they are equal)
S1 S2
S1 S3
S2 S1 (You already calculated this previously, it was S1 S2)
S2 S2 (You already know this is 1.0 because they are equal)
S2 S3
S3 S1 (You already calculated this previously, it was S1 S3)
S3 S2 (You already calculated this previously, it was S2 S3)
S3 S3 (You already know this is 1.0 because they are equal)

As you can see the last loop was entirely pointless Fix these issues and it should be faster.

This also sounds to me like a clustering problem, You can also use some third party libraries for clustering, which probably use underlying efficient C code and are highly parallelized https://scikit-learn.org/stable/modules/clustering.html

CodePudding user response：

As what @oren-revenge has mentioned, skip redundant computations, such as:

l = list(seq_dict.values())
for i in range(len(l)-1):
    for j in range(i 1, len(l)):
        ...

In the second seq_similarity function, every time you call the function, you are incurring unnecessary overhead by casting seq_n/seq_m into np.array. You should do the casting once, store the collection of numpy arrays in perhaps a list, and refer to them in the seq_similarity function. And that should speed it up.