Fastest way to sum the values of the pixels above a threshold in an image with Python-CodePudding

I am trying to find to the best method to retrieve the sum of the pixels value that are bigger than a certain threshold. For example, if my threshold is 253, and I got 10 pixels that are 254, and another 10 that are 255, I expect to get 10*254 10*255 = 5090 - sort of total intensity of the pixels that are over the threshold.

I found a way to do so with np.histogram:

import cv2, time
import numpy as np
threshold = 1
deltaImg = cv2.imread('image.jpg')
t0=time.time()
histogram = np.histogram(deltaImg,256-threshold,[threshold,256])
histoSum = sum(histogram[0]*histogram[1][:-1])
print(histoSum)
print("time = %.2f ms" % ((time.time()-t0)*1000))

This works and I get the sum of the pixels valus that were bigger than the selected threshold. However, I am not sure this is the best/fastest way to go. Obviously, the bigger the threshold is, the faster the action will take.

Does any one has an idea how can I get the right result but with a faster algorithm?

CodePudding user response：

Here you go:

import numpy as np
image = np.random.randint(0,256,(10,10))
threshold = 1
res = np.sum(image[image > threshold])

This operation:

%%timeit
res = np.sum(image[image >=threshold])

takes 5.43 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each).

CodePudding user response：

While OP's approach is fundamentally inaccurate, the underlying idea can still be use to craft an approach that is valid for integer arrays (such as grayscale images):

def sum_gt_hist(arr, threshold):
    values = np.arange(threshold, np.max(arr)   1)
    hist, edges = np.histogram(arr, values   0.5)
    return sum(values[1:] * hist)

This is however non-ideal because it is more complex than it should be (np.histogram() is a relatively complex function which computes much more intermediate information than needed) and would only work for integer values.

A simpler and still pure NumPy approach was proposed in @sehan2's answer:

import numpy as np


def sum_gt_np(arr, threshold):
    return np.sum(arr[arr > threshold])

While the above would be the preferred NumPy-only solution, much faster execution (and memory efficiency) can be obtained with a simple Numba-based solution:

import numba as nb


@nb.njit
def sum_gt_nb(arr, threshold):
    arr = arr.ravel()
    result = 0
    for x in arr:
        if x > threshold:
            result  = x
    return result

Benchmarking the above with a random 100x100 array representing an image, one would get:

import numpy as np


np.random.seed(0)
arr = np.random.randint(0, 256, (100, 100))  # generate a random image
threshold = 253  # set a threshold

funcs = sum_gt_hist, sum_gt_np, sum_gt_nb
for func in funcs:
    print(f"{func.__name__:16s}", end='  ')
    print(func(arr, threshold), end='  ')
    %timeit func(arr, threshold)

# sum_gt_hist       22397  355 µs ± 8.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# sum_gt_np         22397  10.1 µs ± 438 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# sum_gt_nb         22397  1.19 µs ± 33.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This indicates that sum_gt_nb() is largely faster than sum_gt_np() which in turn is largely faster than sum_gt_hist().