I'm processing data where I have two sets of numbers, one for the x values and another for the corresponding y ordinates. I need to decimate the data to reduce the number of x values by eliminating duplicates while returning corresponding y values. I need to do this quickly on very large arrays so the code has to be efficient.
The numpy function 'unique' will eliminate duplicates in the x array. However, for each remaining x array ordinate I need to return the maximum y array ordinate for all those that corresponded to that x value. So if these are two example arrays like this:
x = [ 0, 16, 24, 28, 30, 31, 32, 32, 33, 33, 33, 33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]
What I need to end up with is:
x = [ 0, 16, 24, 28, 30, 31, 32, 33]
y = [1050, 110, 104, 107, 820, 101, 649, 1020]
All help gratefully appreciated.
CodePudding user response:
After sorting, take out the first index of each unique value, which is used for argument of np.maximum.reduceat
:
>>> x = np.asarray(x)
>>> y = np.asarray(y)
>>> perm = x.argsort()
>>> sort = x[perm]
>>> mask = np.concatenate([[True], sort[1:] != sort[:-1]])
>>> sort[mask]
array([ 0, 16, 24, 28, 30, 31, 32, 33])
>>> np.maximum.reduceat(y[perm], mask.nonzero()[0])
array([1050, 110, 104, 107, 820, 101, 649, 1020])
A simple benchmark for large array of 10 ** 6 size:
In [251]: def mechanic(x, y):
...: x = np.asarray(x)
...: y = np.asarray(y)
...: perm = x.argsort()
...: sort = x[perm]
...: mask = np.concatenate([[True], sort[1:] != sort[:-1]])
...: return sort[mask], np.maximum.reduceat(y[perm], mask.nonzero()[0])
...:
In [252]: def claudio(x, y):
...: xout = []
...: yout = []
...: for g, v in groupby(sorted(zip(x, y)), lambda x: x[0]):
...: xout = [g]
...: yout = [max(v)[1]]
...: return xout, yout
...:
In [253]: def joran_beasley(x, y):
...: df = pd.DataFrame({'x': x, 'y': y})
...: return (*df.groupby('x').agg({'x': 'first', 'y': 'max'}).values.T,)
...:
In [254]: import pandas as pd
In [255]: x, y = np.random.randint(0, 100, (2, 10 ** 6))
In [256]: %timeit mechanic(x, y)
65.6 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [257]: %timeit claudio(x, y)
2.6 s ± 56.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [258]: %timeit joran_beasley(x, y)
36.3 ms ± 755 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Start from list:
In [275]: x, y = np.random.randint(0, 100, (2, 10 ** 6)).tolist()
In [276]: %timeit joran_beasley(x, y)
404 ms ± 6.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [277]: %timeit mechanic(x, y)
193 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [278]: %timeit claudio(x, y)
1.02 s ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A little optimization of @Claudio 's solution:
In [283]: def claudio(x, y):
...: xout = []
...: yout = []
...: firstgetter = itemgetter(0)
...: secondgetter = itemgetter(1)
...: for g, v in groupby(sorted(zip(x, y), key=firstgetter), firstgetter):
...: xout.append(g)
...: yout.append(max(map(secondgetter, v)))
...: return xout, yout
...:
In [284]: %timeit claudio(x, y)
495 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The solution using defaultdict
wins when using lists as input:
In [291]: def defdict_solution(x, y):
...: defdict = defaultdict(list)
...: for k, v in zip(x, y):
...: defdict[k].append(v)
...: lst = [(k, max(v)) for k, v in defdict.items()]
...: lst.sort(key=itemgetter(0))
...: return [k for k, v in lst], [v for k, v in lst]
...:
In [292]: %timeit defdict_solution(x, y)
73.9 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
turn it into a dataframe and groupby and aggregate :)
import pandas
x = [ 0, 16, 24, 28, 30, 31, 32, 32, 33, 33, 33, 33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]
df = pandas.DataFrame({'x':x,'y':y})
print(df.groupby('x').agg({'x':'first','y':'max'}))
CodePudding user response:
An approach working on lists as lists are given as data for input and output in the question:
from itertools import groupby
xout = []
yout = []
for g, v in groupby(sorted(zip(xin, yin)), lambda x: x[0]):
xout = [g]
yout = [max(v)[1]]