I have the following numpy arrays (which are actually a pandas column) which represent observations (a position and a value):
df['x'] = np.array([1, 2, 3, 2, 1, 1, 2, 3, 4, 5])
df['y'] = np.array([2, 1, 1, 1, 1, 1, 1, 1, 3, 2])
And instead, I would like to get the following two arrays:
[1 2 3 4 5]
[4 3 2 3 2]
Which is basically grouping all items with the same value in df['x'] and getting the cumulative sum of each value in df['y'], (or in other words getting the cumulative sum of values for each individual position).
Which is the most straightforward way to achieve that in numpy?
CodePudding user response:
As others have noted in comments, if you're already using pandas it's probably a good idea to use a sum over groupby. That being said, if you insist on using raw NumPy you can find the unique indices of x
and then sum up corresponding values in y
in an accumulator array:
import numpy as np
x = np.array([1, 2, 3, 2, 1, 1, 2, 3, 4, 5])
y = np.array([2, 1, 1, 1, 1, 1, 1, 1, 3, 2])
vals, inds = np.unique(x, return_inverse=True)
res = np.zeros_like(vals, dtype=y.dtype)
np.add.at(res, inds, y)
print(res)
# [4 3 2 3 2]
vals
are the unique values in x
and are not actually used here. inds
is the key: these are the index of each value of x
in vals
. These are the positions in the result where we want to accumulate corresponding values from y
. The last trick is using np.add.at
for an unbuffered summation.
The result is stored in res
.
CodePudding user response:
We can try
def groupby(a, b):
sidx = b.argsort(kind='mergesort')
a_sorted = a[sidx]
b_sorted = b[sidx]
cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
out = [sum(a_sorted[i:j]) for i,j in zip(cut_idx[:-1],cut_idx[1:])]
return out
groupby(df['y'].values,df['x'].values)
Out[223]: [4, 3, 2, 3, 2]
Notice the original function you can refer to Divakar 's answer (Thanks Divakar again :-), for teaching me bumpy)