This is a follow up of my previous question in Cumulative sum of a pandas dataframe column without for loop?.
Now, I have a pandas dataframe df
that initially has 2 columns: id
and a
.
e.g., something like
df = {'id': [1, 1, 1, 0, 0, 4, 5, 5], 'a': [2.5, 3.5, 1.0, 1.0 2.0, 4.0, 2.3, 4.4]}
I want to produce
df = {'id': [1, 1, 1, 0, 0, 4, 5, 5], 'a': [2.5, 3.5, 1.0, 1.0 2.0, 4.0, 2.3, 4.4], 'b': [0, 2.5, 6.0, 0.0, 1.0, 0.0, 0.0, 2.3]}
So it introduces a new column b
that has a cumulative sum for each unique id
.
Is there a quick way to do this?
Note that unique id
will always be in adjacent rows.
CodePudding user response:
You can use GroupBy.apply
with pandas.Series.shift
:
df["b"] = (
df.groupby("id", group_keys=False)["a"]
.apply(lambda x: x.shift(fill_value=0).cumsum())
)
# Output :
print(df)
id a b
0 1 2.5 0.0
1 1 3.5 2.5
2 1 1.0 6.0
3 0 1.0 0.0
4 0 2.0 1.0
5 4 4.0 0.0
6 5 2.3 0.0
7 5 4.4 2.3
CodePudding user response:
You can use groupby() for this, but return DataFrame instead of a single row:
df = pd.DataFrame({'id': [1, 1, 1, 0, 0, 4, 5, 5], 'a': [2.5, 3.5, 1.0, 1.0, 2.0, 4.0, 2.3, 4.4]})
def cumsum(df):
df['b'] = df['a'].shift(1).cumsum().fillna(0)
return df
df2 = df.groupby('id').apply(cumsum)
If the same id is in adjacent rows, you can also try this:
df['id_change'] = df['id'].diff().abs() > 0
df['cumsum'] = df['a'].shift(1).cumsum().fillna(0)
df['base'] = df['cumsum'].where(df['id_change'], np.nan).fillna(method='ffill').fillna(0)
df['b'] = df['cumsum'] - df['base']
I left all the columns on purpose so it is easier to see what is going on.