Home > Net >  Get cumulative sum in pandas
Get cumulative sum in pandas

Time:03-17

Context

Datetime Campaign_name Status Open_time
2022-03-15 00:00 Funny_campaign Open
2022-03-15 01:00 Funny_campaign Continue
2022-03-15 02:00 Funny_campaign Continue
2022-03-15 03:00 Funny_campaign Continue
2022-03-15 04:00 Funny_campaign Close
2022-03-15 08:00 Funny_campaign Open
2022-03-15 09:00 Funny_campaign Continue
2022-03-15 10:00 Funny_campaign Close

Problem

I need to calculate the time from open to close.

My code right now

There are two approches I could go with. Get the open time in each 'Close' or a cumulative open_time in each 'Open' and 'Continue'. Here is my take on the last one.

My code right now is almost fine, it doesn't count the time between Close and Open but it forgets to sum the last time difference.

df["Datetime"] = pd.to_datetime(df["Datetime"])
df["time_diff"] = df["Datetime"].diff()
df["time_diff"] = df["time_diff"].astype("timedelta64[m]").fillna(0)
condition = df["Status"] == "Close"
df.loc[condition, "time_diff"] = 0
df["Cumulative time"] = df.groupby(["Campaign_name"])["time_diff"].cumsum()
df = df.drop("time_diff", 1)

CodePudding user response:

IIUC, you could start new groups on the opens and use:

df['Datetime'] = pd.to_datetime(df['Datetime'])

group = df['Status'].eq('Open').cumsum()

df['Open_time'] = df.groupby(group)['Datetime'].apply(lambda g: g-g.iloc[0])
# or, alternative syntax
# df['Open_time'] = df.groupby(group)['Datetime'].apply(lambda g: g.diff().cumsum())

Output:

             Datetime   Campaign_name    Status       Open_time
0 2022-03-15 00:00:00  Funny_campaign      Open 0 days 00:00:00
1 2022-03-15 01:00:00  Funny_campaign  Continue 0 days 01:00:00
2 2022-03-15 02:00:00  Funny_campaign  Continue 0 days 02:00:00
3 2022-03-15 03:00:00  Funny_campaign  Continue 0 days 03:00:00
4 2022-03-15 04:00:00  Funny_campaign     Close 0 days 04:00:00
5 2022-03-15 08:00:00  Funny_campaign      Open 0 days 00:00:00
6 2022-03-15 09:00:00  Funny_campaign  Continue 0 days 01:00:00
7 2022-03-15 10:00:00  Funny_campaign     Close 0 days 02:00:00

Or to only assign to "Close":

df.loc[df['Status'].eq('Close'), 'Open_time'] = df.groupby(group)['Datetime'].apply(lambda g: g-g.iloc[0])

Output:

             Datetime   Campaign_name    Status        Open_time
0 2022-03-15 00:00:00  Funny_campaign      Open              NaN
1 2022-03-15 01:00:00  Funny_campaign  Continue              NaN
2 2022-03-15 02:00:00  Funny_campaign  Continue              NaN
3 2022-03-15 03:00:00  Funny_campaign  Continue              NaN
4 2022-03-15 04:00:00  Funny_campaign     Close  0 days 04:00:00
5 2022-03-15 08:00:00  Funny_campaign      Open              NaN
6 2022-03-15 09:00:00  Funny_campaign  Continue              NaN
7 2022-03-15 10:00:00  Funny_campaign     Close  0 days 02:00:00

And for just the difference close-open for each group:

df.groupby(group)['Datetime'].agg(lambda g: g.iloc[-1]-g.iloc[0])

Output:

Status
1   0 days 04:00:00
2   0 days 02:00:00
Name: Datetime, dtype: timedelta64[ns]
  • Related