I am trying to manually iterate through the chunks of a dask array, one by one, and apply my computation. I understand that a benefit of dask is that it can to do the iteration for me, but my computation is failing (for reasons that I don't think are related to dask) and I want to iterate through manually for the purpose of debugging. How would I do that?
I am imagining something like:
import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))
for chunk in data.iterchunks():
# chunk would contain some information about which chunk I have access to,
# and I could somehow get the data contained in that chunk
chunk_data = get_chunk(chunk)
my_function(chunk_data)
Where the chunk
that I get back has some information about which chunk I am in, and there would also be get the data for that chunk.
CodePudding user response:
Access the data within each chunk using the arr.blocks
property. The BlockView object has an array-like interface, but accessing an element in the BlockView array returns the selected chunk(s) in the original array:
In [11]: data
Out[11]: dask.array<randint, shape=(1000, 100, 100), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>
In [12]: data.blocks
Out[12]: <dask.array.core.BlockView at 0x1730b2da0>
In [13]: data.blocks.shape
Out[13]: (1, 10, 10)
In [14]: data.blocks[0, 0, 0]
Out[14]: dask.array<blocks, shape=(1000, 10, 10), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>
In [15]: data.blocks[0, 0, 0].compute()
Out[15]:
array([[[14, 5, 24, ..., 25, 20, 6],
[17, 12, 2, ..., 27, 13, 18],
[13, 25, 2, ..., 7, 5, 22],
...,
[12, 22, 26, ..., 15, 4, 11],
[ 0, 26, 28, ..., 22, 14, 4],
[ 9, 21, 14, ..., 15, 18, 21]],
...,
[[ 3, 2, 20, ..., 27, 0, 12],
[21, 17, 7, ..., 23, 3, 23],
[24, 13, 0, ..., 26, 1, 0],
...,
[ 5, 25, 6, ..., 22, 6, 16],
[16, 25, 21, ..., 22, 14, 15],
[ 8, 20, 17, ..., 29, 13, 1]]])
So in your case, you could loop through all blocks with the following:
In [34]: for inds in itertools.product(*map(range, data.blocks.shape)):
...: chunk = data.blocks[inds]
...: my_function(chunk)
This will be slow, but it does I think what you're looking for.
CodePudding user response:
Try using data.chunks
instead of data.iterchunks()
.