Python - Pandas: Create new column that is the aggregate sum of another column's group conditio-CodePudding

Sample Data:

Column A	Column B	Column C
Bill	1	2022-09-01
John	0	2022-09-02
Bill	1	2022-09-04
Bill	0	2022-09-10

I want to create a column where column B is summed based on column A... but only the instances where column C is less than or equal to the current row.

What I Want:

Column A	Column B	Column C	NEW COL
Bill	1	2022-09-01	0
John	0	2022-09-02	0
Bill	1	2022-09-04	1
Bill	0	2022-09-10	2

So the new column is the sum of column B grouped on column A, but it is only summing column B for dates in column C that are prior to the current row. Therefore, in the last row above, Bill has a 2 in the NEW COL because column B is summed for all instances prior to 2022-09-10.

I have my groupby for a basic aggregation:

df.groupby('Column A')['Column B'].transform(np.sum)

But this doesn't take the date into account and I am stuck on whether to use a groupby at all or if I need to use a lambda function instead.

CodePudding user response：

Try this.

 df['NEW COL'] = (df.groupby('Column A')
                   .apply(lambda x: x[x['Column C'] <= x['Column C']]
                   .sum()['Column B']))

CodePudding user response：

You might want a shifted cumsum per group:

df['NEW COL'] = (df
   .sort_values(by='Column C')
   .groupby('Column A')['Column B']
   .transform(lambda s: s.cumsum().shift(fill_value=0))
)

Output:

  Column A  Column B    Column C  NEW COL
0     Bill         1  2022-09-01        0
1     John         0  2022-09-02        0
2     Bill         1  2022-09-04        1
3     Bill         0  2022-09-10        2

handling duplicated dates

df2 = (df
   .groupby(['Column A', 'Column C'], as_index=False)['Column B'].sum()
   .sort_values(by='Column C')
)
    
out = df.merge(df2.join(df2.groupby('Column A', group_keys=False)['Column B']
                           .apply(lambda s: s.cumsum().shift(fill_value=0))
                           .rename('NEW COL')
                       ).drop(columns='Column B'),
               on=['Column A', 'Column C'], how='left'
               )

Output:

  Column A  Column B    Column C  NEW COL
0     Bill         1  2022-09-01        0
1     John         0  2022-09-02        0
2     Bill         1  2022-09-04        1
3     Bill         1  2022-09-10        2
4     Bill         0  2022-09-10        2
5     Bill         1  2022-09-11        3