Python filter array with multiprocessing-CodePudding

Right now I'm filtering an array using

arr = [a for a in tqdm(replays) if check(a)]

However with hundreds of thousands of elements, this takes alot of time. I was wondering if it was possible to do this with multiprocessing, ideally in a nice and compact pythonic way.

Thanks!

CodePudding user response：

I was having the same issue when trying to group hundreds of thousands of elements, the solution was using https://docs.python.org/3/library/itertools.html

The performance improves a lot, but looks like python has some issues when sorting/grouping/filtering big collections in memory

CodePudding user response：

Define a multiprocessing-using parallel filter function pfilter:

from multiprocessing import Pool

def pfilter(filter_func, arr, cores):
    with Pool(cores) as p:
        booleans = p.map(filter_func, arr)
        return [x for x, b in zip(arr, booleans) if b]

async means that order of execution is truly independent from each other between the elements.

The usage in your case is (4 cpus):

arr = pfilter(check, tqdm(replays), 4)

For some weird reasons however, the filter_func isn't allowed to be a lambda expression or defined as one ...

CodePudding user response：

concurrent.futures module provides a nice interface to multithread and multiprocess operations.

def check(a):
  return (a % 2 == 0)

if __name__ == "__main__":
  array = [1,2,3,4,5]

  from concurrent.futures import ProcessPoolExecutor
  with ProcessPoolExecutor(max_workers=3) as ppe:
    res = [a for a, flg in zip(array, ppe.map(check, array)) if flg]
  print(res)

# [2,4]