Here is my code before using multiprocessing. It is a task to get the number of items in a massive iterator which satisfy specified conditions:
from itertools import permutations
def f(input_):
if 'AB' in ''.join(input_):
return True
else:
return False
if __name__ == "__main__":
iterator = permutations(['A', 'B',...])
count = 0
for item in iterator: # it is an itertools.permutations object, with str inside
if f(item):
count = 1
print(count)
But the iterator is too massive that I need to do multiprocessing or multithread(not sure which is better) to speed the process up.
I have referred to many online references about multi task in Python, and I have tried a few ways. Unfortunately, I still cannot find a solution, since each method I tried is having some problems. For example:
from multiprocessing import Pool
def f(input_):
if 'AB' in ''.join(input_):
return True
else:
return False
if __name__ == "__main__":
pool = Pool()
result = pool.imap_unordered(f, iterator)
print(sum(result))
In this example, the problem is that this code runs even slower than my original one. I have also tried using pool.map(), but it is also slower than before and it uses up all of my ram.
How should I do to make this filtering task as fast as possible using all my CPU's ability? Multiprocessing and multithreading really confuse me. :(
CodePudding user response:
Multiprocessing has a huge overhead compared to itertools.permutations
. At the same time, pretty much any problem with permutations can be solved using simple factorials.
Your "huge" iterator can be written as
data = ['A', 'B', 'C', ...]
pattern = ('A', 'B')
sum(pattern in x for x in permutations(data))
That being said, there are factorial(len(data))
possible total permutations. If data
has no repeats, then there are factorial(len(data) - len(pattern))
possible arrangements of items besides those in pattern
, and len(data) - len(pattern) 1
places where pattern
can live.
As of python 3.8, you can do
from math import prod
count = prod(range(2, len(data) - len(pattern) 2))
For prior versions, you would have to do
from functools import reduce
from operator import mul
count = reduce(mul, range(2, len(data) - len(pattern) 2), 1)
For cases where data
has repeats that are present in pattern
, you can do a google search for something like "how many permutations will contain a particular sequence" to help you figure out the analytical formula.