Dask where returns NaN on valid array-CodePudding

I'm trying to accelerate my numpy code using dask. Following is a part of my numpy code

arr_1 = np.load('<arr1_path>.npy')
arr_2 = np.load('<arr2_path>.npy')
arr_3 = np.load('<arr3_path>.npy')

arr_1 = np.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = np.where(arr_4 == True)
[rn,cn] = np.where(arr_4 == False)

print(len(r))

This prints valid results and is working fine. However, following dask equivalent

arr_1 = da.from_zarr('<arr1_path>.zarr')
arr_2 = da.from_zarr('<arr2_path>.zarr')
arr_3 = da.from_zarr('<arr3_path>.zarr')

arr_1 = da.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = da.where(arr_4 == True)
[rn,cn] = da.where(arr_4 == False)

print(len(r)) # <----- Error: float' object cannot be interpreted as an integer

results in r as

dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>

and thus the above mentioned error. Since dask arrays are lazily evaluated, do I have to explicitly call compute() or similar somewhere? Or am I missing something basic? Any help will be appreciated.

CodePudding user response：

The array you've constructed with da.where has unknown chunk sizes, which can happen whenever the size of an array depends on lazy computations that haven’t yet been performed. Unknown values within shape or chunks are designated using np.nan rather than an integer, which is why you see the ValueError (this error message was improved in the last few months). The solution is to use compute_chunk_sizes:

import dask.array as da
x = da.from_array(np.random.randn(100), chunks=20)
y = x[x > 0]
# len(y) # ValueError: Cannot call len() on object with unknown chunk size.
y.compute_chunk_sizes() # modifies y in-place
len(y)