Subset cumulative sum of a pandas dataframe column without for loop?-CodePudding

This is a follow up of my previous question in Cumulative sum of a pandas dataframe column without for loop?.

Now, I have a pandas dataframe df that initially has 2 columns: id and a. e.g., something like

df = {'id': [1, 1, 1, 0, 0, 4, 5, 5], 'a': [2.5, 3.5, 1.0, 1.0 2.0, 4.0, 2.3, 4.4]}

I want to produce

df = {'id': [1, 1, 1, 0, 0, 4, 5, 5], 'a': [2.5, 3.5, 1.0, 1.0 2.0, 4.0, 2.3, 4.4], 'b': [0, 2.5, 6.0, 0.0, 1.0, 0.0, 0.0, 2.3]}

So it introduces a new column b that has a cumulative sum for each unique id. Is there a quick way to do this?

Note that unique id will always be in adjacent rows.

CodePudding user response：

You can use GroupBy.apply with pandas.Series.shift :

df["b"] = (
            df.groupby("id", group_keys=False)["a"]
              .apply(lambda x: x.shift(fill_value=0).cumsum())
          )

# Output :

print(df)
   id    a    b
0   1  2.5  0.0
1   1  3.5  2.5
2   1  1.0  6.0
3   0  1.0  0.0
4   0  2.0  1.0
5   4  4.0  0.0
6   5  2.3  0.0
7   5  4.4  2.3

CodePudding user response：

You can use groupby() for this, but return DataFrame instead of a single row:

df = pd.DataFrame({'id': [1, 1, 1, 0, 0, 4, 5, 5], 'a': [2.5, 3.5, 1.0, 1.0, 2.0, 4.0, 2.3, 4.4]})


def cumsum(df):
    df['b'] = df['a'].shift(1).cumsum().fillna(0)
    return df


df2 = df.groupby('id').apply(cumsum)

If the same id is in adjacent rows, you can also try this:

df['id_change'] = df['id'].diff().abs() > 0
df['cumsum'] = df['a'].shift(1).cumsum().fillna(0)
df['base'] = df['cumsum'].where(df['id_change'], np.nan).fillna(method='ffill').fillna(0)
df['b'] = df['cumsum'] - df['base']

I left all the columns on purpose so it is easier to see what is going on.