Pandas calculate time deltas based on some conditions-CodePudding

Having the following DF of mobile users activity:

    user_id                timestamp  wifi
0         1  2021-11-23 11:00:00.000     1
1         1  2021-11-23 11:01:00.000     1
2         1  2021-11-23 11:02:00.000     1
3         1  2021-11-23 11:10:00.000     1
4         1  2021-11-23 11:11:00.000     0
5         1  2021-11-23 11:22:00.000     0
6         2  2021-11-23 11:40:00.000     1
7         2  2021-11-23 11:41:00.000     1
8         2  2021-11-23 11:42:00.000     1
9         2  2021-11-23 11:43:00.000     0
10        2  2021-11-23 11:44:00.000     0
11        2  2021-11-23 11:48:00.000     0

user_id: Users identification
timestamp: Time of the log
wifi: Boolean wifi or cellular usage

I would like to calculate the time usage for both wifi and cellular connections with the following constraints:

A continues usage is defined by two rows with less than 5 min apart.
A row with no continued event will not be counted.

Result should be as follows: for simplicity, I've populated the time spent columns with a number describing the minutes passed. Actual values should be time delta.

    user_id                timestamp  wifi  wifi_time_spent  cell_time_spent
0         1  2021-11-23 11:00:00.000     1                0                0
1         1  2021-11-23 11:01:00.000     1                1                0
2         1  2021-11-23 11:02:00.000     1                2                0
------------------------------ more then 5 min -----------------------------
3         1  2021-11-23 11:10:00.000     1                2                0 <-- single, not adding.
---------------- only 1 event before switching to cellular  ----------------
4         1  2021-11-23 11:11:00.000     0                2                0
------------------------------ more then 5 min -----------------------------
5         1  2021-11-23 11:22:00.000     0                2                0 <-- single, not adding.
---------------------------------- new user --------------------------------
6         2  2021-11-23 11:40:00.000     1                0                0
7         2  2021-11-23 11:41:00.000     1                1                0
8         2  2021-11-23 11:42:00.000     1                2                0
--------------------------- switching to cellular --------------------------
9         2  2021-11-23 11:43:00.000     0                2                0
10        2  2021-11-23 11:44:00.000     0                2                1
11        2  2021-11-23 11:48:00.000     0                2                5

I've written the following code to mark each 5 minutes session with unique id:

df['timestamp'] = pd.to_datetime(df.timestamp)
df['session_grp'] = df.groupby('user_id').apply(
    lambda x: (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])).ngroup()).reset_index(
    drop=True).values.reshape(-1)

And it seems to be working fine:

    user_id           timestamp  wifi  session_grp
0         1 2021-11-23 11:00:00     1            0
1         1 2021-11-23 11:01:00     1            0
2         1 2021-11-23 11:02:00     1            0
3         1 2021-11-23 11:10:00     1            2
4         1 2021-11-23 11:11:00     0            2
5         1 2021-11-23 11:22:00     0            4
6         2 2021-11-23 11:40:00     1            0
7         2 2021-11-23 11:41:00     1            0
8         2 2021-11-23 11:42:00     1            0
9         2 2021-11-23 11:43:00     0            0
10        2 2021-11-23 11:44:00     0            0
11        2 2021-11-23 11:48:00     0            1

But that is it, I'm stuck. Any help would be appreciated.

CodePudding user response：

# convert to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
# groupby 5 minute intervals
df['grp'] = df.groupby('user_id')['timestamp'].diff().dt.seconds.gt(300).cumsum()  
# calc the difference and fill misssing values with 0
df['diff'] = df.groupby(['user_id', 'wifi', 'grp'])['timestamp'].diff().fillna(pd.Timedelta(0))

# use loc to filter the frame and assign the diff value for each slice (i.e., wifi and cell)
df.loc[df['wifi'] == 1, 'wifi_time_spent'] = df.loc[df['wifi'] == 1, 'diff']
df.loc[df['wifi'] == 0, 'cell_time_spent'] = df.loc[df['wifi'] == 0, 'diff']
# drop columns not needed and fill the missing values with 0
df = df.drop(columns=['grp', 'diff']).fillna(pd.Timedelta(0))  

# groupby one more time and calculate the cumsum for each column
df['wifi_time_spent'] = df.groupby('user_id')['wifi_time_spent'].cumsum()
df['cell_time_spent'] = df.groupby('user_id')['cell_time_spent'].cumsum()

Out

    user_id           timestamp  wifi wifi_time_spent cell_time_spent
0         1 2021-11-23 11:00:00     1 0 days 00:00:00 0 days 00:00:00
1         1 2021-11-23 11:01:00     1 0 days 00:01:00 0 days 00:00:00
2         1 2021-11-23 11:02:00     1 0 days 00:02:00 0 days 00:00:00
3         1 2021-11-23 11:10:00     1 0 days 00:02:00 0 days 00:00:00
4         1 2021-11-23 11:11:00     0 0 days 00:02:00 0 days 00:00:00
5         1 2021-11-23 11:22:00     0 0 days 00:02:00 0 days 00:00:00
6         2 2021-11-23 11:40:00     1 0 days 00:00:00 0 days 00:00:00
7         2 2021-11-23 11:41:00     1 0 days 00:01:00 0 days 00:00:00
8         2 2021-11-23 11:42:00     1 0 days 00:02:00 0 days 00:00:00
9         2 2021-11-23 11:43:00     0 0 days 00:02:00 0 days 00:00:00
10        2 2021-11-23 11:44:00     0 0 days 00:02:00 0 days 00:01:00
11        2 2021-11-23 11:48:00     0 0 days 00:02:00 0 days 00:05:00