Having the following DF of mobile users activity:
user_id timestamp wifi
0 1 2021-11-23 11:00:00.000 1
1 1 2021-11-23 11:01:00.000 1
2 1 2021-11-23 11:02:00.000 1
3 1 2021-11-23 11:10:00.000 1
4 1 2021-11-23 11:11:00.000 0
5 1 2021-11-23 11:22:00.000 0
6 2 2021-11-23 11:40:00.000 1
7 2 2021-11-23 11:41:00.000 1
8 2 2021-11-23 11:42:00.000 1
9 2 2021-11-23 11:43:00.000 0
10 2 2021-11-23 11:44:00.000 0
11 2 2021-11-23 11:48:00.000 0
user_id: Users identification
timestamp: Time of the log
wifi: Boolean wifi or cellular usage
I would like to calculate the time usage for both wifi and cellular connections with the following constraints:
- A continues usage is defined by two rows with less than 5 min apart.
- A row with no continued event will not be counted.
Result should be as follows: for simplicity, I've populated the time spent columns with a number describing the minutes passed. Actual values should be time delta.
user_id timestamp wifi wifi_time_spent cell_time_spent
0 1 2021-11-23 11:00:00.000 1 0 0
1 1 2021-11-23 11:01:00.000 1 1 0
2 1 2021-11-23 11:02:00.000 1 2 0
------------------------------ more then 5 min -----------------------------
3 1 2021-11-23 11:10:00.000 1 2 0 <-- single, not adding.
---------------- only 1 event before switching to cellular ----------------
4 1 2021-11-23 11:11:00.000 0 2 0
------------------------------ more then 5 min -----------------------------
5 1 2021-11-23 11:22:00.000 0 2 0 <-- single, not adding.
---------------------------------- new user --------------------------------
6 2 2021-11-23 11:40:00.000 1 0 0
7 2 2021-11-23 11:41:00.000 1 1 0
8 2 2021-11-23 11:42:00.000 1 2 0
--------------------------- switching to cellular --------------------------
9 2 2021-11-23 11:43:00.000 0 2 0
10 2 2021-11-23 11:44:00.000 0 2 1
11 2 2021-11-23 11:48:00.000 0 2 5
I've written the following code to mark each 5 minutes session with unique id:
df['timestamp'] = pd.to_datetime(df.timestamp)
df['session_grp'] = df.groupby('user_id').apply(
lambda x: (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])).ngroup()).reset_index(
drop=True).values.reshape(-1)
And it seems to be working fine:
user_id timestamp wifi session_grp
0 1 2021-11-23 11:00:00 1 0
1 1 2021-11-23 11:01:00 1 0
2 1 2021-11-23 11:02:00 1 0
3 1 2021-11-23 11:10:00 1 2
4 1 2021-11-23 11:11:00 0 2
5 1 2021-11-23 11:22:00 0 4
6 2 2021-11-23 11:40:00 1 0
7 2 2021-11-23 11:41:00 1 0
8 2 2021-11-23 11:42:00 1 0
9 2 2021-11-23 11:43:00 0 0
10 2 2021-11-23 11:44:00 0 0
11 2 2021-11-23 11:48:00 0 1
But that is it, I'm stuck. Any help would be appreciated.
CodePudding user response:
# convert to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
# groupby 5 minute intervals
df['grp'] = df.groupby('user_id')['timestamp'].diff().dt.seconds.gt(300).cumsum()
# calc the difference and fill misssing values with 0
df['diff'] = df.groupby(['user_id', 'wifi', 'grp'])['timestamp'].diff().fillna(pd.Timedelta(0))
# use loc to filter the frame and assign the diff value for each slice (i.e., wifi and cell)
df.loc[df['wifi'] == 1, 'wifi_time_spent'] = df.loc[df['wifi'] == 1, 'diff']
df.loc[df['wifi'] == 0, 'cell_time_spent'] = df.loc[df['wifi'] == 0, 'diff']
# drop columns not needed and fill the missing values with 0
df = df.drop(columns=['grp', 'diff']).fillna(pd.Timedelta(0))
# groupby one more time and calculate the cumsum for each column
df['wifi_time_spent'] = df.groupby('user_id')['wifi_time_spent'].cumsum()
df['cell_time_spent'] = df.groupby('user_id')['cell_time_spent'].cumsum()
Out
user_id timestamp wifi wifi_time_spent cell_time_spent
0 1 2021-11-23 11:00:00 1 0 days 00:00:00 0 days 00:00:00
1 1 2021-11-23 11:01:00 1 0 days 00:01:00 0 days 00:00:00
2 1 2021-11-23 11:02:00 1 0 days 00:02:00 0 days 00:00:00
3 1 2021-11-23 11:10:00 1 0 days 00:02:00 0 days 00:00:00
4 1 2021-11-23 11:11:00 0 0 days 00:02:00 0 days 00:00:00
5 1 2021-11-23 11:22:00 0 0 days 00:02:00 0 days 00:00:00
6 2 2021-11-23 11:40:00 1 0 days 00:00:00 0 days 00:00:00
7 2 2021-11-23 11:41:00 1 0 days 00:01:00 0 days 00:00:00
8 2 2021-11-23 11:42:00 1 0 days 00:02:00 0 days 00:00:00
9 2 2021-11-23 11:43:00 0 0 days 00:02:00 0 days 00:00:00
10 2 2021-11-23 11:44:00 0 0 days 00:02:00 0 days 00:01:00
11 2 2021-11-23 11:48:00 0 0 days 00:02:00 0 days 00:05:00