Home > Mobile >  How to solve this using numpy vectorization
How to solve this using numpy vectorization

Time:05-25

I have a really big input numpy array, and a dictionary. The dictionary dictates what the values in the numpy array should be updated to. I can do it using a for loop but it is very time consuming, can I use numpy vectorization to solve this?

Input:

arr_to_check = numpy.array([['A', 20],['B', 100],['C', 80],['D', 90], ['E', 100]]) # actual length is ~10^8
max_possible = {'A': 25, 'B': 40, 'C': 90, 'D': 50, 'F': 100, 'G': 90} # actual length is ~10^3

Expexted Result:

[['A', '20'], # do not change, because 20 < 25 --- max possible for 'A' is 25.
['B', '0'], # change to 0, because 100 > 50 --- max possible for 'B' is 40.
['C', '80'], # do not change, because 80 < 90
['D', '0'], # change to 0, because 90 > 50 --- max possible for 'D' is 50.
['E', '100' ]] 

Here is the loop solution:

for i in range(arr_to_check.shape[0]):
    row = arr_to_check[i]
    if row[0] in max_possible and int(row[1]) > max_possible[row[0]]:
        row[1] = 0

CodePudding user response:

Here is a way to do what you've asked (UPDATED to simplify the code).

A few notes first:

  • numpy arrays must be of homogeneous type, so the numbers you show in your question will be converted by numpy to strings to match the data type of the labels (if pandas is an option, it might allow you to have columns of numbers co-exist with distinct columns of strings).
  • Though I have taken the result all the way through to match the original homogeneous data type (string), you can stop early and use the intermediate 1D numerical results if that's all you need.
  • I have used int as the numeric type, and you can change this to float if required.
import numpy
arr_to_check = numpy.array([['A', 20],['B', 100],['C', 80],['D', 90], ['E', 100]])
max_possible = {'A': 25, 'B': 40, 'C': 90, 'D': 50, 'F': 100, 'G': 90}
print('arr_to_check:'); print(arr_to_check)

aT = arr_to_check.T
labels = aT[0,:]
values = aT[1,:].astype(int)
print('labels:'); print(labels)
print('values:'); print(values)

for label, value in max_possible.items():
    curMask = (labels == label)
    values[curMask] *= (values[curMask] <= value)
print('values:'); print(values)

aT[1,:] = values
arr_to_check = aT.T
print('arr_to_check:'); print(arr_to_check)

Input:

arr_to_check:
[['A' '20']
 ['B' '100']
 ['C' '80']
 ['D' '90']
 ['E' '100']]

Output:

labels:
['A' 'B' 'C' 'D' 'E']
values:
[ 20 100  80  90 100]
values:
[ 20   0  80   0 100]
arr_to_check:
[['A' '20']
 ['B' '0']
 ['C' '80']
 ['D' '0']
 ['E' '100']]

Explanation:

  • Transpose the input so that we can use vectorized operations directly on the numeric vector (values).
  • Iterate over each key/value pair in max_possible and use a vectorized formula to multiply values by 0 if the value in max_possible has been breached for rows whose label (in labels) matches the key in max_possible.
  • Update the original numpy array using values.

CodePudding user response:

As others have pointed out that numpy arrays are homogeneous, your output elements will all have str. If that is ok, you can use apply_along_axis:

t = lambda x: [x[0],0] if  x[0] in max_possible and int(x[1]) > max_possible[x[0]] else x
numpy.apply_along_axis(t, 1, arr_to_check)

CodePudding user response:

As other said, you should use only numbers in your numpy array. So you could have your data like this:

arr_to_check = np.array([[0, 20],[1, 100],[2, 80],[3, 90], [4, 100]])
max_possible = np.array([25, 40, 90, 50, np.inf, 100, 90])

Here I have assumed 'A': 0, 'B': 1, ... Note that this way, not only strings have been replaced by numbers, but dict has also been replaced by a Numpy array where max_possible[i] is max for i-th string, facilitating subsequent operations.

Now, you obtain what you want with:

m = max_possible.take(arr_to_check.T[0]) 
m1 = np.array([arr_to_check.T[0], np.minimum(arr_to_check.T[1], m)]) 
m1.T
  • 1st line puts in m the max value of each key.

  • 2nd line puts in m1 your keys as first row, and min of your values and max of each key

  • 3rd row transposes as your result:

    array([[ 0., 20.], [ 1., 40.], [ 2., 80.], [ 3., 50.], [ 4., 100.]])

CodePudding user response:

Running your code:

In [362]: %%timeit arr = arr_to_check.copy()
     ...: for i in range(arr.shape[0]):
     ...:     row = arr[i]
     ...:     if row[0] in max_possible and int(row[1]) > max_possible[row[0]]:
     ...:         row[1] = 0
     ...:         
14.1 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Iterating like this on an array is slower than working with lists, so lets try a pure list solution:

In [372]: alist_to_check = [['A', 20],['B', 100],['C', 80],['D', 90], ['E', 100]]
     ...: max_possible = {'A': 25, 'B': 40, 'C': 90, 'D': 50, 'F': 100, 'G': 90}

Using a list comprehension with an if/else expression:

In [373]: [[k,0] if k in max_possible and v>max_possible[k] else [k,v] for k,v in alist_to_check]
Out[373]: [['A', 20], ['B', 0], ['C', 80], ['D', 0], ['E', 100]]

In [374]: timeit [[k,0] if k in max_possible and v>max_possible[k] else [k,v] for k,v in alist_to_check]
1.45 µs ± 3.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

One of the answers suggested a apply_along_axis - with the keys redefine at integer. My timing came at

In [366]: timeit np.apply_along_axis(t, 1, arr_to_check)
108 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

For a small example the pure list approach is fastest. For really large case we can probably cast it as a numpy problem that scales better, but I haven't looked at those options.

with structured array

We could turn the list into a structured array. This preserves the string and int dtypes:

In [398]: arr = np.array([tuple(kv) for kv in alist_to_check],'U10,int')

In [399]: arr
Out[399]: 
array([('A',  20), ('B', 100), ('C',  80), ('D',  90), ('E', 100)],
      dtype=[('f0', '<U10'), ('f1', '<i4')])

In [400]: arr['f0']
Out[400]: array(['A', 'B', 'C', 'D', 'E'], dtype='<U10')

In [401]: arr['f1']
Out[401]: array([ 20, 100,  80,  90, 100])

If max_possible is small relative to the list, it could be most efficient to iterate on its items, and set the corresponding elements of the structured array. For example:

def foo(alist):
    arr = np.array([tuple(kv) for kv in alist],'U10,int')
    for k,v in max_possible.items():
        idx = np.nonzero((arr['f0']==k) & (arr['f1']>v))[0]
        arr['f1'][idx] = 0
    return arr

In [395]: foo(alist_to_check)
Out[395]: 
array([('A',  20), ('B',   0), ('C',  80), ('D',   0), ('E', 100)],
      dtype=[('f0', '<U10'), ('f1', '<i4')])

For this sample, the times aren't that great:

In [397]: timeit foo(alist_to_check)
102 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

For a big list:

In [403]: biglist = alist_to_check*10000

In [409]: timeit foo(biglist)
44.1 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [410]: timeit [[k,0] if k in max_possible and v>max_possible[k] else [k,v] for k,v in biglist]
14.8 ms ± 682 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Time still isn't that great. However a big chunk of that is in creating the structured array:

In [411]: timeit arr = np.array([tuple(kv) for kv in biglist],'U10,int')
38.4 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If we already had the structured array, I expect the times to be much better.

Curiously, making a pure string dtype array from that biglist takes even longer:

In [412]: timeit np.array(biglist)
74.2 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Still, this does make it clear that working a dict and string matching, lists remain competative with numpy solutions. numpy is best for purely numeric work.

  • Related