Maximum values in y numpy array for corresponding unique values in x array-CodePudding

I'm processing data where I have two sets of numbers, one for the x values and another for the corresponding y ordinates. I need to decimate the data to reduce the number of x values by eliminating duplicates while returning corresponding y values. I need to do this quickly on very large arrays so the code has to be efficient.

The numpy function 'unique' will eliminate duplicates in the x array. However, for each remaining x array ordinate I need to return the maximum y array ordinate for all those that corresponded to that x value. So if these are two example arrays like this:

x = [   0,  16,  24,  28,  30,  31,  32,  32,  33,  33,   33,  33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]

What I need to end up with is:

x = [   0,  16,  24,  28,  30,  31,  32,   33]
y = [1050, 110, 104, 107, 820, 101, 649, 1020]

All help gratefully appreciated.

CodePudding user response：

After sorting, take out the first index of each unique value, which is used for argument of np.maximum.reduceat:

>>> x = np.asarray(x)
>>> y = np.asarray(y)
>>> perm = x.argsort()
>>> sort = x[perm]
>>> mask = np.concatenate([[True], sort[1:] != sort[:-1]])
>>> sort[mask]
array([ 0, 16, 24, 28, 30, 31, 32, 33])
>>> np.maximum.reduceat(y[perm], mask.nonzero()[0])
array([1050,  110,  104,  107,  820,  101,  649, 1020])

A simple benchmark for large array of 10 ** 6 size:

In [251]: def mechanic(x, y):
     ...:     x = np.asarray(x)
     ...:     y = np.asarray(y)
     ...:     perm = x.argsort()
     ...:     sort = x[perm]
     ...:     mask = np.concatenate([[True], sort[1:] != sort[:-1]])
     ...:     return sort[mask], np.maximum.reduceat(y[perm], mask.nonzero()[0])
     ...:

In [252]: def claudio(x, y):
     ...:     xout = []
     ...:     yout = []
     ...:     for g, v in groupby(sorted(zip(x, y)), lambda x: x[0]):
     ...:         xout  = [g]
     ...:         yout  = [max(v)[1]]
     ...:     return xout, yout
     ...:

In [253]: def joran_beasley(x, y):
     ...:     df = pd.DataFrame({'x': x, 'y': y})
     ...:     return (*df.groupby('x').agg({'x': 'first', 'y': 'max'}).values.T,)
     ...:

In [254]: import pandas as pd

In [255]: x, y = np.random.randint(0, 100, (2, 10 ** 6))

In [256]: %timeit mechanic(x, y)
65.6 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [257]: %timeit claudio(x, y)
2.6 s ± 56.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [258]: %timeit joran_beasley(x, y)
36.3 ms ± 755 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Start from list:

In [275]: x, y = np.random.randint(0, 100, (2, 10 ** 6)).tolist()

In [276]: %timeit joran_beasley(x, y)
404 ms ± 6.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [277]: %timeit mechanic(x, y)
193 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [278]: %timeit claudio(x, y)
1.02 s ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A little optimization of @Claudio 's solution:

In [283]: def claudio(x, y):
     ...:     xout = []
     ...:     yout = []
     ...:     firstgetter = itemgetter(0)
     ...:     secondgetter = itemgetter(1)
     ...:     for g, v in groupby(sorted(zip(x, y), key=firstgetter), firstgetter):
     ...:         xout.append(g)
     ...:         yout.append(max(map(secondgetter, v)))
     ...:     return xout, yout
     ...:

In [284]: %timeit claudio(x, y)
495 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The solution using defaultdict wins when using lists as input:

In [291]: def defdict_solution(x, y):
     ...:     defdict = defaultdict(list)
     ...:     for k, v in zip(x, y):
     ...:         defdict[k].append(v)
     ...:     lst = [(k, max(v)) for k, v in defdict.items()]
     ...:     lst.sort(key=itemgetter(0))
     ...:     return [k for k, v in lst], [v for k, v in lst]
     ...:

In [292]: %timeit defdict_solution(x, y)
73.9 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

turn it into a dataframe and groupby and aggregate :)

import pandas

x = [   0,  16,  24,  28,  30,  31,  32,  32,  33,  33,   33,  33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]

df = pandas.DataFrame({'x':x,'y':y})
print(df.groupby('x').agg({'x':'first','y':'max'}))

CodePudding user response：

An approach working on lists as lists are given as data for input and output in the question:

from itertools import groupby
xout = []
yout = []
for g, v in groupby(sorted(zip(xin, yin)), lambda x: x[0]):
    xout  = [g]
    yout  = [max(v)[1]]