Home > Back-end >  Python filter massive iterator with multiprocessing
Python filter massive iterator with multiprocessing

Time:11-12

Here is my code before using multiprocessing. It is a task to get the number of items in a massive iterator which satisfy specified conditions:

from itertools import permutations

def f(input_):
    if 'AB' in ''.join(input_):
        return True
    elsereturn False

if __name__ == "__main__":
    iterator = permutations(['A', 'B',...])
    count = 0
    for item in iterator:  # it is an itertools.permutations object, with str inside
        if f(item):
            count  = 1
    print(count)

But the iterator is too massive that I need to do multiprocessing or multithread(not sure which is better) to speed the process up.

I have referred to many online references about multi task in Python, and I have tried a few ways. Unfortunately, I still cannot find a solution, since each method I tried is having some problems. For example:

from multiprocessing import Pool

def f(input_):
    if 'AB' in ''.join(input_):
        return True
    elsereturn False

if __name__ == "__main__":
    pool = Pool()
    result = pool.imap_unordered(f, iterator)
    print(sum(result))

In this example, the problem is that this code runs even slower than my original one. I have also tried using pool.map(), but it is also slower than before and it uses up all of my ram.

How should I do to make this filtering task as fast as possible using all my CPU's ability? Multiprocessing and multithreading really confuse me. :(

CodePudding user response:

Multiprocessing has a huge overhead compared to itertools.permutations. At the same time, pretty much any problem with permutations can be solved using simple factorials.

Your "huge" iterator can be written as

data = ['A', 'B', 'C', ...]
pattern = ('A', 'B')
sum(pattern in x for x in permutations(data))

That being said, there are factorial(len(data)) possible total permutations. If data has no repeats, then there are factorial(len(data) - len(pattern)) possible arrangements of items besides those in pattern, and len(data) - len(pattern) 1 places where pattern can live.

As of python 3.8, you can do

from math import prod

count = prod(range(2, len(data) - len(pattern)   2))

For prior versions, you would have to do

from functools import reduce
from operator import mul

count = reduce(mul, range(2, len(data) - len(pattern)   2), 1)

For cases where data has repeats that are present in pattern, you can do a google search for something like "how many permutations will contain a particular sequence" to help you figure out the analytical formula.

  • Related