Is there a faster method for iterating over a very big 2D numpy array than using np.where?-CodePudding

i have a huge 2D numpy array filled with integer values. I collect them from a .tif-image via gdal.GetRasterBand(). The pixel values of the image represent unique cluster-identification numbers. So all pixels inside one cluster have the same value. In my script i want to check if the clusters have more pixels than a specific threshold. If the clustersize is bigger than the threshold I want to keep the cluster and give them a pixel value 1. If a cluster has less pixel then the threshold, all pixels of this cluster should get the value 0.

My code so far works, but is very very slow. And because i want to vary the threshold, it takes like forever. I would really appreciate your help. Thank you.

# Import GeoTIFF via GDAL and convert to NumpyArray
data = gdal.Open(image)
raster = data.GetRasterBand(1)
raster = raster.ReadAsArray()

# Different thresholds for iteration
thresh = [0,10,25,50,100,1000,2000]

for threshold in thresh:
        clusteredRaster = np.array(raster.copy(), dtype = int)

        for clump in np.unique(clusteredRaster): # Unique ids of the clusters in image

            if clusteredRaster[np.where(clusteredRaster == clump)].size >= threshold: 
                clusteredRaster[np.where(clusteredRaster == clump)] = int(1)

            else:
                clusteredRaster[np.where(clusteredRaster == clump)] = int(0)
'''

[ClusterImage][1]

In the image you can see the cluster image. Each color stands vor a specific clusternumber. I want to delete the small ones (under a specific size) and just keep the big ones.

  [1]: https://i.stack.imgur.com/miEKg.png

CodePudding user response：

there are a number of modifications that can be done to improve performance,

clusteredRaster = np.array(raster.copy(), dtype = int)

can be replaced with

clusteredRaster = raster.astype(int)

which this is essentially both a copy and a casting operator so this operation is faster.

now for

clusteredRaster[np.where(clusteredRaster == clump)] = int(1)

you don't need to call np.where, this will work faster

clusteredRaster[clusteredRaster == clump] = int(1)

also done for this part

clusteredRaster[np.where(clusteredRaster == clump)].size

you can also remove the evaluation of clusteredRaster==clump twice as follow:

for clump in np.unique(clusteredRaster): # Unique ids of the clusters in image
    indicies = clusteredRaster==clump

    if clusteredRaster[indicies].size >= threshold: 
        clusteredRaster[indicies] = int(1)

    else:
        clusteredRaster[indicies] = int(0)

i think your function will now work twice as fast, however if you want to run faster, you have to use smaller datatypes like np.uint8 instead of plain int, provided your image is encoded in RGB and can be represented by 8 bit ints (or maybe np.uint16 if 8 bits is too low ?) this is as fast as it can get from python side.

there are faster methods like using C modules with openmp to multithread your work across multiple cores, this can easily be done with something like numba or cython without having to worry about writing C code, but there's a lot of reading to do if you want to achieve the fastest performance ever, like which threading backend to use (TBB vs openmp) and some os and device dependent capabilities.

CodePudding user response：

In addition to the changes suggested by Ahmed Mohamed AEK you can also take the calculation of unique values, indices, and counts outside of the for loops. Plus you don't need to copy raster each time - you can make an array of np.uint8s.

This gives the same results as your original implementation:

data = gdal.Open(image)
raster = data.GetRasterBand(1).ReadAsArray()

# Different thresholds for iteration
thresh = [0, 10, 25, 50, 100, 1000, 2000]

# determine the unique clumps and their frequencies outside of the for loops
clumps, counts = np.unique(raster, return_counts=True)

# only determine the indices once, rather than for each threshold
indices = np.asarray([raster==clump for clump in clumps])

for threshold in thresh:
    
    clustered_raster = np.zeros_like(raster, dtype=np.uint8)

    for clump_indices, clump_counts in zip(indices, counts):
    
        clustered_raster[clump_indices] = clump_counts >= threshold