How to save memory when using multiprocessing.map?-CodePudding

I have a function that basically takes a pair of integers (x,y) and produce a vector with like 3000 elements. So, I used:

pool_obj=multiprocessing.Pool()
result=np.array(pool_obj.map(f, RANGE))

where RANGE is the Cartesian product of the two sets of values x,y may take respectively.

My problem is that all I need is np.sum(result,axis=0) which is 3000 long. I want to sum over all x and y. There are 1000x1000 pairs of (x,y) in total. Using this approach will create a super huge array that is 1000000x3000 big and exceeds the memory limit.

How can I resolve this?

CodePudding user response：

Example of using a generator for x, y pairs to reduce input size, while using imap to reduce output size (reduce data as it comes in back to the main process)

import multiprocessing as mp
import numpy as np
from time import sleep

class yield_xy:
    """
    Generator for x, y pairs prevents all pairs of x and y from being generated
    at the start of the map call. In this example it would only be a million
    floats, so on the order of 4-8 Mb of data, but if x, and y are bigger
    (or maybe you have a z) this could dramatically reduce input data size
    """
    def __init__(self, x, y):
        self._x = x
        self._y = y
        
    def __len__(self): #map, imap, map_async, starmap etc.. need the input size ahead of time
        return len(self._x) * len(self._y)
    
    def __iter__(self): #simple generator needs storage x   y rather than x * y
        for x in self._x:
            for y in self._y:
                yield x, y

def task(args):
    x, y = args
    return (np.zeros(3000)   x) * y


def main():
    x = np.arange(0,1000)
    y = np.sin(x)
    
    out = np.zeros(3000)
    
    with mp.Pool() as pool:
        for result in pool.imap(task, yield_xy(x, y)):
            out  = result #accumulate results
    return out


if __name__ == "__main__":
    result = main()