Home > Software engineering >  Binning an array of values to the closest value in a discrete set using Numpy & Numba
Binning an array of values to the closest value in a discrete set using Numpy & Numba

Time:05-28

I've got a function below which takes in an array of floats as well as an array of discrete integers. For all of the floats, I want them to be rounded to the closest integer in the list.

The below function works perfectly, where sHatV is an array of 10,000 floats and possible_locations is an array of 5 integers:

binnedV = [min(possible_locations, key=lambda x:abs(x-bv)) for bv in sHatV]

As this function is going to be called thousands of time I'm trying to use the @numba.njit decorator to minimize computation time.

I thought about using np.digitize in my 'numbafied' function but it rounds values out of bounds to zeros. I want everything to be binned to one of the values in possible locations.

Overall, I need to write a numba compatible function which takes every value in the first array of length N, finds the closest value to it in array 2, and return that closest value, culminating in an array of length N with the binned values.

Any help is appreciated!

CodePudding user response:

Here's a version that runs much faster, and is probably more "numbifiable" since it uses numpy functions instead of the implicit for loop of a list comprehension:

import numpy as np

sHatV = [0.33, 4.18, 2.69]
possible_locations = np.array([0, 1, 2, 3, 4, 5])

diff_matrix = np.subtract.outer(sHatV, possible_locations)
idx = np.abs(diff_matrix).argmin(axis=1)
result = possible_locations[idx]

print(result)
# output: [0 4 3]

The idea here is to calculate a difference matrix between sHatv and possible_locations. In this particular example, that matrix is:

array([[ 0.33, -0.67, -1.67, -2.67, -3.67, -4.67],
       [ 4.18,  3.18,  2.18,  1.18,  0.18, -0.82],
       [ 2.69,  1.69,  0.69, -0.31, -1.31, -2.31]])

Then, with np.abs( ... ).argmin(axis=1), we find the index of each row where the absolute difference is minimal. If we index the original possible_locations array by these indices, we get to the answer.

Comparing the runtime:

using list comprehension

def f(possible_locations, sHatV):
    return [min(possible_locations, key=lambda x:abs(x-bv)) for bv in sHatV]


def test_f():
    possible_locations = np.array([0, 1, 2, 3, 4, 5])
    sHatV = np.random.uniform(0.1, 4.9, size=10_000)
    f(possible_locations, sHatV)


%timeit test_f()
# 187 ms ± 7.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

using difference matrix

def g(possible_locations, sHatV):
    return possible_locations[np.abs(np.subtract.outer(sHatV, bins)).argmin(axis=1)]


def test_g():
    possible_locations = np.array([0, 1, 2, 3, 4, 5])
    sHatV = np.random.uniform(0.1, 4.9, size=10_000)
    g(possible_locations, sHatV)

%timeit test_g()
# 556 µs ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

CodePudding user response:

I would suggest sticking to numpy for this. The digitize function is close to what you need but requires a bit of modification:

  • implement rounding logic instead of floor/ceil
  • account for endpoint issues. The documentation says: If values in `x` are beyond the bounds of `bins`, 0 or ``len(bins)`` is returned as appropriate.

Here's an example:

import numpy as np
sHatV = np.array([-99, 1.4999, 1.5, 3.1, 3.9, 99.5, 1000])
bins = np.arange(0,101)

def custom_round(arr, bins):
    bin_centers = (bins[:-1]   bins[1:])/2 
    idx = np.digitize(sHatV, bin_centers)
    result = bins[idx]
    return result

assert np.all(custom_round(sHatV, bins) == np.array([0, 1, 2, 3, 4, 100, 100]))

And now my favorite part: how fast is numpy at this? I won't do scaling, we'll just pick large arrays:

sHatV = 10009*np.random.random(int(1e6))
bins = np.arange(10000)

%timeit custom_round(sHatV, bins)
# on a laptop: 100 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • Related