I'm trying to accelerate my numpy
code using dask
. Following is a part of my numpy
code
arr_1 = np.load('<arr1_path>.npy')
arr_2 = np.load('<arr2_path>.npy')
arr_3 = np.load('<arr3_path>.npy')
arr_1 = np.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = np.where(arr_4 == True)
[rn,cn] = np.where(arr_4 == False)
print(len(r))
This prints valid results and is working fine. However, following dask
equivalent
arr_1 = da.from_zarr('<arr1_path>.zarr')
arr_2 = da.from_zarr('<arr2_path>.zarr')
arr_3 = da.from_zarr('<arr3_path>.zarr')
arr_1 = da.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = da.where(arr_4 == True)
[rn,cn] = da.where(arr_4 == False)
print(len(r)) # <----- Error: float' object cannot be interpreted as an integer
results in r
as
dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>
and thus the above mentioned error. Since dask
arrays are lazily evaluated, do I have to explicitly call compute()
or similar somewhere? Or am I missing something basic? Any help will be appreciated.
CodePudding user response:
The array you've constructed with da.where
has unknown chunk sizes, which can happen whenever the size of an array depends on lazy computations that haven’t yet been performed. Unknown values within shape or chunks are designated using np.nan rather than an integer, which is why you see the ValueError
(this error message was improved in the last few months). The solution is to use compute_chunk_sizes
:
import dask.array as da
x = da.from_array(np.random.randn(100), chunks=20)
y = x[x > 0]
# len(y) # ValueError: Cannot call len() on object with unknown chunk size.
y.compute_chunk_sizes() # modifies y in-place
len(y)