Sample Data:
Column A | Column B | Column C |
---|---|---|
Bill | 1 | 2022-09-01 |
John | 0 | 2022-09-02 |
Bill | 1 | 2022-09-04 |
Bill | 0 | 2022-09-10 |
I want to create a column where column B is summed based on column A... but only the instances where column C is less than or equal to the current row.
What I Want:
Column A | Column B | Column C | NEW COL |
---|---|---|---|
Bill | 1 | 2022-09-01 | 0 |
John | 0 | 2022-09-02 | 0 |
Bill | 1 | 2022-09-04 | 1 |
Bill | 0 | 2022-09-10 | 2 |
So the new column is the sum of column B grouped on column A, but it is only summing column B for dates in column C that are prior to the current row. Therefore, in the last row above, Bill has a 2 in the NEW COL because column B is summed for all instances prior to 2022-09-10.
I have my groupby for a basic aggregation:
df.groupby('Column A')['Column B'].transform(np.sum)
But this doesn't take the date into account and I am stuck on whether to use a groupby at all or if I need to use a lambda function instead.
CodePudding user response:
Try this.
df['NEW COL'] = (df.groupby('Column A')
.apply(lambda x: x[x['Column C'] <= x['Column C']]
.sum()['Column B']))
CodePudding user response:
You might want a shifted cumsum
per group:
df['NEW COL'] = (df
.sort_values(by='Column C')
.groupby('Column A')['Column B']
.transform(lambda s: s.cumsum().shift(fill_value=0))
)
Output:
Column A Column B Column C NEW COL
0 Bill 1 2022-09-01 0
1 John 0 2022-09-02 0
2 Bill 1 2022-09-04 1
3 Bill 0 2022-09-10 2
handling duplicated dates
df2 = (df
.groupby(['Column A', 'Column C'], as_index=False)['Column B'].sum()
.sort_values(by='Column C')
)
out = df.merge(df2.join(df2.groupby('Column A', group_keys=False)['Column B']
.apply(lambda s: s.cumsum().shift(fill_value=0))
.rename('NEW COL')
).drop(columns='Column B'),
on=['Column A', 'Column C'], how='left'
)
Output:
Column A Column B Column C NEW COL
0 Bill 1 2022-09-01 0
1 John 0 2022-09-02 0
2 Bill 1 2022-09-04 1
3 Bill 1 2022-09-10 2
4 Bill 0 2022-09-10 2
5 Bill 1 2022-09-11 3