I have a function that basically takes a pair of integers (x,y)
and produce a vector with like 3000 elements. So, I used:
pool_obj=multiprocessing.Pool()
result=np.array(pool_obj.map(f, RANGE))
where RANGE
is the Cartesian product of the two sets of values x,y may take respectively.
My problem is that all I need is np.sum(result,axis=0)
which is 3000 long. I want to sum over all x and y. There are 1000x1000 pairs of (x,y)
in total. Using this approach will create a super huge array that is 1000000x3000 big and exceeds the memory limit.
How can I resolve this?
CodePudding user response:
Example of using a generator for x, y
pairs to reduce input size, while using imap
to reduce output size (reduce data as it comes in back to the main process)
import multiprocessing as mp
import numpy as np
from time import sleep
class yield_xy:
"""
Generator for x, y pairs prevents all pairs of x and y from being generated
at the start of the map call. In this example it would only be a million
floats, so on the order of 4-8 Mb of data, but if x, and y are bigger
(or maybe you have a z) this could dramatically reduce input data size
"""
def __init__(self, x, y):
self._x = x
self._y = y
def __len__(self): #map, imap, map_async, starmap etc.. need the input size ahead of time
return len(self._x) * len(self._y)
def __iter__(self): #simple generator needs storage x y rather than x * y
for x in self._x:
for y in self._y:
yield x, y
def task(args):
x, y = args
return (np.zeros(3000) x) * y
def main():
x = np.arange(0,1000)
y = np.sin(x)
out = np.zeros(3000)
with mp.Pool() as pool:
for result in pool.imap(task, yield_xy(x, y)):
out = result #accumulate results
return out
if __name__ == "__main__":
result = main()