Use dask for an out of core conversion of iterable.product into a numpy/dask array (create a matrix-CodePudding

I am looking to create a matrix (numpy array of numpy arrays) of every permutation with repetition (I want to use it for matrix multiplication later on). Currently the way I am doing it, I first create a list of lists then use itertools and then convert to a numpy array of numpy arrays. However as R, the length of each permutation increases the size of the numpy array exponentially increases and causes a memory error. So, I want to generate a the matrix in dask instead. I went through the dask tutorials but haven't worked out how to do this yet.

For example every 5 number combination of the numbers from -1 to 1 (inclusive) using a step size of 0.1 (r = 5, n = 21):

# Create 5 lists each with 21 elements
lst = []
for i in range(0,5):
    lst.append(np.linspace(-1,1,21).tolist())
lst

# Convert to a list of tuples, each tuple is a permutation e.g. -1,-1,-1,-1,-1 or -1,-1,-1,-1,-0.9
lst = list(itertools.product(*lst))
# Convert to a numpy array of numpy arrays for matrix multiplication later on
mat = np.array(lst)

Creating permutations of length 5 is already the maximum my laptop can handle given I am using N = 21. But I already get a memory error when trying to do a length of 6.

I've tried creating a function and using dask delay in together with list comprehension and also dask.array.from_array(), but I am still really new to dask and haven't found the solution yet.

Ideally I would be able to increase the length of the permutations (R) from 5 to somewhere around 10-20 (using the same N = 21 or decreasing it all the way to N = 5), anything above that would be awesome to have but not necessary.

CodePudding user response：

The memory problems is due to this line:

lst = list(itertools.product(*lst))

Without list() this would be a generator, so would not require a lot of memory. Hence, a solution might be to examine the matrix operations downstream and see if they can be performed on subsets of the matrix you are trying to construct (either on blocks or row/column-wise slices). If such subset-operations are possible, then the code can be refactored to work with subsets.

If this is not possible, then the following approach using dask.bags might be helpful:

from dask import compute
from dask.bag import from_sequence

a = from_sequence([1, 2], npartitions=2)
b = from_sequence([3, 6, 9], npartitions=2)

print(*compute(a.product(b)))
# [(1, 3), (1, 6), (2, 3), (2, 6), (1, 9), (2, 9)]

Or closer to your example:

from dask.bag import from_sequence
from numpy import linspace

a = from_sequence(linspace(1, 10, 10), npartitions=2)
b = from_sequence(linspace(20, 30, 10), npartitions=2)
c = a.product(b)
print(c.to_dataframe().to_dask_array(lengths=True))
# dask.array<values, shape=(100, 2), dtype=float64, chunksize=(25, 2), chunktype=numpy.ndarray>

Note that the number of partitions of a.product(b) is a product of the number of partitions of a and b, so you will want to experiment with what is the most meaningful split for your use case.