Finding the 1D numpy array based on the max value in the second column of a 2D numpy array-CodePudding

I'm currently working in removing some 1D arrays based on the values of one of the columns from a 2D array. The first column may have different and repeated values, I want to keep one of each repeated value based on the max value of the second column (this is just an example, the 2d array may be bigger) here is what I tried

import numpy as np

arr = np.array([[ 36.06, 209.14],
                [ 36.06, 214.55],
                [ 36.06, 215.91],
                [ 36.06, 225.29],
                [ 41.11, 186.76],
                [ 41.11, 191.79],
                [ 41.11, 197.21],
                [ 41.11, 197.33],
                [ 41.11, 201.19],
                [ 41.11, 206.15],
                [ 50.25, 165.51],
                [ 50.25, 174.32],
                [ 59.03, 148.79]])     

biggest = 0
aux = []
for i in range(arr.shape[0]-1):
    j = i 1
    if (arr[i][0] == arr[j][0]):
        if (arr[i][1] < arr[j][1] and arr[j][1] > biggest):
            biggest = j
    if (arr[i][0] != arr[j][0]):
        aux.append(arr[biggest])

print(np.array(aux))

#Output = [[ 36.06 225.29]
#          [ 41.11 206.15]
#          [ 50.25 174.32]]

As you can see, I get almost the desired result, my expected result should be something like this...

Output = [[ 36.06 225.29]
          [ 41.11 206.15]
          [ 50.25 174.32]
          [ 59.03 148.79]]

The thing is I'm missing the last array and maybe there is an easier way using numpy built-in functions that I'm missing. Thank you in advance!

CodePudding user response：

No reason to reinvent the wheel. Just use pandas.

import pandas as pd

pd.DataFrame(arr).groupby(0, as_index=False).max().to_numpy()

>> array([[ 36.06, 225.29],
          [ 41.11, 206.15],
          [ 50.25, 174.32],
          [ 59.03, 148.79]])

Alternative

The input seems sorted in both columns, meaning the highest value per key is always the last. If that is the case, or if it can be accomplished by sorting, a plain numpy version is also possible.

# if not already sorted, sort as described above
sorted_array = arr[np.lexsort((arr[:, 1], arr[:, 0]))]
# find the last value per key
keys = sorted_array[:, 0]
ends = np.append(keys[1:] != keys[:-1], True)
# extract rows
return sorted_array[ends]

If we include the cost of sorting, this has a higher computational complexity than the pandas version (assuming the pandas version uses hash tables; haven't checked) Shape of the data and quality of the implementation may change actual runtime.

CodePudding user response：

one way is to apply np.unique on the first column to find the unique values in that column (note np.unique will get unique values in sorted scheme in default which is working on your example), then check the maximum value index in the second column for each of that unique values and append to your list:

aux = []
for i in np.unique(arr[:, 0]):
    arr_ = arr[arr[:, 0] == i]
    aux.append(arr_[arr_[:, 1].argmax()])

or using arrays instead list appending:

uniques_ = np.unique(arr[:, 0])
# [36.06 41.11 50.25 59.03]

result = np.empty((uniques_.shape[0], arr.shape[1]))
for i, j in enumerate(uniques_):
    arr_ = arr[arr[:, 0] == j]
    result[i] = arr_[arr_[:, 1].argmax()]

# result
# [[ 36.06 225.29]
#  [ 41.11 206.15]
#  [ 50.25 174.32]
#  [ 59.03 148.79]]

to preserve orderings of the first column using np.unique, if we have:

arr = np.array([[ 41.11, 186.76],
                [ 41.11, 191.79],
                [ 41.11, 197.21],
                [ 41.11, 197.33],
                [ 41.11, 201.19],
                [ 41.11, 206.15],
                [ 36.06, 209.14],
                [ 36.06, 214.55],
                [ 36.06, 215.91],
                [ 36.06, 225.29],
                [ 50.25, 165.51],
                [ 50.25, 174.32],
                [ 59.03, 148.79]])

_, idx = np.unique(arr[:, 0], return_index=True)
uniques_ = arr[:, 0][np.sort(idx)]
result = np.empty((uniques_.shape[0], arr.shape[1]))
for i, j in enumerate(uniques_):
    arr_ = arr[arr[:, 0] == j]
    result[i] = arr_[arr_[:, 1].argmax()]

# result
# [[ 41.11 206.15]
#  [ 36.06 225.29]
#  [ 50.25 174.32]
#  [ 59.03 148.79]]