Home > Mobile >  Deduplicate numpy array by another array
Deduplicate numpy array by another array

Time:03-18

I have two numpy arrays:

a = np.array([0, 1, 2, 2, 3])
b = np.array([0.9, 0.6, 0.5, 0.8, 1.0])

a is the index of items, and b is the score of corresponding items. Now I want to sort these items descendingly by the scores in b while only keeping the largest score of a single item. The results should be the non-duplicated item index a_new and the score of these items b_new.

In the example above, I need:

a_new = np.array([3, 0, 2, 1])
b_new = np.array([1.0, 0.9, 0.8, 0.6])

I know I can do this with scatter_max however it's a little slow. Is there any easier and faster solutions?

Note that I don't want to transform the array to a dictionary, which is a trivial solution. I need a batched solution because I have millions of such arrays.

CodePudding user response:

After ordering the arrays in descending order using ordering, repeated values could be removed by np.unique:

ordering = np.argsort(b)[::-1]
a = a[ordering]
b = b[ordering]
undup_ind = np.unique(a, return_index=True)[1]
b = b[np.sort(undup_ind)]

This will be the fastest or one of the fastest ways to reach the goal; It ran in 0.5 seconds in my tested case by 1.000.000 data volume.

CodePudding user response:

Have you tried with pandas?

import numpy as np
import pandas as pd

a = np.array([0, 1, 2, 2, 3])
b = np.array([0.9, 0.6, 0.5, 0.8, 1.0])

df = pd.DataFrame(np.stack([a, b], axis=1), columns = ['a', 'b'])
df = df.groupby('a')['b'].max().to_frame().reset_index().sort_values(by=['b'], ascending=False)

a_new = df['a'].to_numpy()
b_new = df['b'].to_numpy()

If you want parallel processing, can explore PySpark, Dask, and alike.

  • Related