I would like to get a cumulative sum of tran_amt for each Cust ID within 24 hours of first transaction. Please see my below example for illustration.
Original Data
DateTime | Tran_amt | Cust_ID |
---|---|---|
1/1/2021 2:00:00 PM | 1000 | c103102 |
1/1/2021 3:00:00 PM | 2000 | c103102 |
1/2/2021 10:00:00 AM | 2000 | c103102 |
1/2/2021 11:00:00 AM | 1000 | c211203 |
1/2/2021 12:00:00 PM | 1000 | c103102 |
1/2/2021 5:00:00 PM | 2000 | c103102 |
1/3/2021 3:00:00 AM | 1000 | c211203 |
Expected Output Data
DateTime | Tran_amt | Cust_ID | First Transaction DateTime | Cumulative_amt | Remark |
---|---|---|---|---|---|
1/1/2021 2:00:00 PM | 1000 | c103102 | 1/1/2021 2:00:00 PM | 1000 | |
1/1/2021 3:00:00 PM | 2000 | c103102 | 1/1/2021 2:00:00 PM | 3000 | |
1/2/2021 10:00:00 AM | 2000 | c103102 | 1/1/2021 2:00:00 PM | 5000 | |
1/2/2021 11:00:00 AM | 1000 | c211203 | 1/2/2021 1:00:00 PM | 1000 | |
1/2/2021 12:00:00 PM | 1000 | c103102 | 1/1/2021 2:00:00 PM | 6000 | |
1/2/2021 5:00:00 PM | 2000 | c103102 | 1/2/2021 5:00:00 PM | 2000 | The tran datetime is exceeding 24 hours of previous first transaction Datetime, and thus the cumulative_amt is reset |
1/3/2021 3:00:00 AM | 1000 | c211203 | 1/2/2021 1:00:00 PM | 2000 |
Hope someone can help me the above question. Thanks a lot.
CodePudding user response:
For example :
import pandas as pd
df = pd.DataFrame({'DateTime': ['1/1/2021 2:00:00 PM', '1/1/2021 3:00:00 PM', '1/2/2021 10:00:00 AM', '1/2/2021 11:00:00 AM', '1/2/2021 12:00:00 PM', '1/2/2021 5:00:00 PM', '1/3/2021 3:00:00 AM'],
'Tran_amt': [1000, 2000, 2000, 1000, 1000, 2000, 1000],
'Cust_ID': ['c103102', 'c103102', 'c103102', 'c211203', 'c103102', 'c103102', 'c211203']})
df['DateTime'] = pd.to_datetime(df['DateTime'])
df = df.sort_values(by=['Cust_ID', 'DateTime'])
df['First Transaction DateTime'] = df.groupby('Cust_ID')['DateTime'].transform('first')
df['time_diff'] = (df['DateTime'] - df['First Transaction DateTime']).dt.total_seconds() / 3600
df['Cumulative_amt'] = df.groupby('Cust_ID').apply(lambda x: x['Tran_amt'].cumsum().where(x['time_diff'] <= 24, 0)).reset_index(drop=True)
df['Remark'] = df['Cumulative_amt'].where(df['Cumulative_amt'] != 0, '')
CodePudding user response:
You need to use the datetime module and other things to make it work.