Right now I'm filtering an array using
arr = [a for a in tqdm(replays) if check(a)]
However with hundreds of thousands of elements, this takes alot of time. I was wondering if it was possible to do this with multiprocessing, ideally in a nice and compact pythonic way.
Thanks!
CodePudding user response:
I was having the same issue when trying to group hundreds of thousands of elements, the solution was using https://docs.python.org/3/library/itertools.html
The performance improves a lot, but looks like python has some issues when sorting/grouping/filtering big collections in memory
CodePudding user response:
Define a multiprocessing-using parallel filter function pfilter
:
from multiprocessing import Pool
def pfilter(filter_func, arr, cores):
with Pool(cores) as p:
booleans = p.map(filter_func, arr)
return [x for x, b in zip(arr, booleans) if b]
async
means that order of execution is truly independent from each other between the elements.
The usage in your case is (4 cpus):
arr = pfilter(check, tqdm(replays), 4)
For some weird reasons however, the filter_func isn't allowed to be a lambda expression or defined as one ...
CodePudding user response:
concurrent.futures module provides a nice interface to multithread and multiprocess operations.
def check(a):
return (a % 2 == 0)
if __name__ == "__main__":
array = [1,2,3,4,5]
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=3) as ppe:
res = [a for a, flg in zip(array, ppe.map(check, array)) if flg]
print(res)
# [2,4]