How to get a sorted cumulative array of values in numpy?-CodePudding

I have the following numpy arrays (which are actually a pandas column) which represent observations (a position and a value):

df['x'] = np.array([1, 2, 3, 2, 1, 1, 2, 3, 4, 5])
df['y'] = np.array([2, 1, 1, 1, 1, 1, 1, 1, 3, 2])

And instead, I would like to get the following two arrays:

[1 2 3 4 5]
[4 3 2 3 2]

Which is basically grouping all items with the same value in df['x'] and getting the cumulative sum of each value in df['y'], (or in other words getting the cumulative sum of values for each individual position).

Which is the most straightforward way to achieve that in numpy?

CodePudding user response：

As others have noted in comments, if you're already using pandas it's probably a good idea to use a sum over groupby. That being said, if you insist on using raw NumPy you can find the unique indices of x and then sum up corresponding values in y in an accumulator array:

import numpy as np

x = np.array([1, 2, 3, 2, 1, 1, 2, 3, 4, 5])
y = np.array([2, 1, 1, 1, 1, 1, 1, 1, 3, 2])

vals, inds = np.unique(x, return_inverse=True)
res = np.zeros_like(vals, dtype=y.dtype)
np.add.at(res, inds, y)

print(res)
# [4 3 2 3 2]

vals are the unique values in x and are not actually used here. inds is the key: these are the index of each value of x in vals. These are the positions in the result where we want to accumulate corresponding values from y. The last trick is using np.add.at for an unbuffered summation.

The result is stored in res.

CodePudding user response：

We can try

def groupby(a, b):
    sidx = b.argsort(kind='mergesort')
    a_sorted = a[sidx]
    b_sorted = b[sidx]
    cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
    out = [sum(a_sorted[i:j]) for i,j in zip(cut_idx[:-1],cut_idx[1:])]
    return out


groupby(df['y'].values,df['x'].values)
Out[223]: [4, 3, 2, 3, 2]

Notice the original function you can refer to Divakar 's answer (Thanks Divakar again :-), for teaching me bumpy)