I have two numpy arrays:
a = np.array([0, 1, 2, 2, 3])
b = np.array([0.9, 0.6, 0.5, 0.8, 1.0])
a
is the index of items, and b
is the score of corresponding items. Now I want to sort these items descendingly by the scores in b
while only keeping the largest score of a single item. The results should be the non-duplicated item index a_new
and the score of these items b_new
.
In the example above, I need:
a_new = np.array([3, 0, 2, 1])
b_new = np.array([1.0, 0.9, 0.8, 0.6])
I know I can do this with scatter_max
however it's a little slow. Is there any easier and faster solutions?
Note that I don't want to transform the array to a dictionary, which is a trivial solution. I need a batched solution because I have millions of such arrays.
CodePudding user response:
After ordering the arrays in descending order using ordering
, repeated values could be removed by np.unique
:
ordering = np.argsort(b)[::-1]
a = a[ordering]
b = b[ordering]
undup_ind = np.unique(a, return_index=True)[1]
b = b[np.sort(undup_ind)]
This will be the fastest or one of the fastest ways to reach the goal; It ran in 0.5 seconds in my tested case by 1.000.000 data volume.
CodePudding user response:
Have you tried with pandas?
import numpy as np
import pandas as pd
a = np.array([0, 1, 2, 2, 3])
b = np.array([0.9, 0.6, 0.5, 0.8, 1.0])
df = pd.DataFrame(np.stack([a, b], axis=1), columns = ['a', 'b'])
df = df.groupby('a')['b'].max().to_frame().reset_index().sort_values(by=['b'], ascending=False)
a_new = df['a'].to_numpy()
b_new = df['b'].to_numpy()
If you want parallel processing, can explore PySpark, Dask, and alike.