I am currently working with session logs and are interested in counting the number of occurrences of specific events in a custom time frame (first 1 [5, 10, after 10]) minute. To simplify: the start of a session is defined as the time of the first occurrence of a relevant event. I have already filtered the sessions by only the relevant events and the dataframe looks similar to this.
Input
import pandas as pd
data_in = {'SessionId': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'Timestamp': ['2020-08-24 12:46:30.726000 00:00', '2020-08-24 12:46:38.726000 00:00', '2020-08-24 12:49:30.726000 00:00', '2020-08-24 12:50:49.726000 00:00', '2020-08-24 12:58:30.726000 00:00', '2021-02-12 16:12:12.726000 00:00', '2021-02-12 16:15:24.726000 00:00', '2021-02-12 16:31:07.726000 00:00', '2020-12-03 23:58:17.726000 00:00', '2020-12-04 00:03:44.726000 00:00'],
'event': ['match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match']
}
df_in = pd.DataFrame(data_in)
df_in
Desired Output:
data_out = {'SessionId': ['A', 'B', 'C'],
'#events_first_1_minute': [2, 1, 1],
'#events_first_5_minute': [4, 2, 1],
'#events_first_10_minute': [4, 2, 2],
'#events_after_10_minute': [5, 3, 2]
}
df_out = pd.DataFrame(data_out)
df_out
I already played around with groupby and pd.Grouper. I get the number of relevant events per session in total, but I don´t see any option for custom time bins. Another idea was also to get rid of the date part and focus only on the time, but there are of course also sessions that started on a day and ended on the other (SessionId: C).
Any help is appreciated!
CodePudding user response:
Using pandas.cut
:
df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])
bins = ['1min', '5min', '10min']
bins2 = pd.to_timedelta(['0'] bins ['10000days'])
group = pd.cut(df_in.groupby('SessionId')['Timestamp'].apply(lambda x: x-x.min()),
bins=bins2, labels=bins ['>' bins[-1]]).fillna(bins[0])
(df_in
.groupby(['SessionId', group]).size()
.unstack(level=1)
.cumsum(axis=1)
)
output:
Timestamp 1min 5min 10min >10min
SessionId
A 2 4 4 5
B 1 2 2 3
C 1 1 2 2
CodePudding user response:
First convert your Timestamp
column to datetime64
dtype then group by SessionId
and aggregate data:
# Not mandatory if it's already the case
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
out = df.groupby('SessionId')['Timestamp'] \
.agg(**{'<1m': lambda x: sum(x < x.iloc[0] pd.DateOffset(minutes=1)),
'<5m': lambda x: sum(x < x.iloc[0] pd.DateOffset(minutes=5)),
'<10m': lambda x: sum(x < x.iloc[0] pd.DateOffset(minutes=10)),
'>10m': 'size'}).reset_index()
Output:
>>> out
SessionId <1m <5m <10m >10m
0 A 2 4 4 5
1 B 1 2 2 3
2 C 1 1 2 2
CodePudding user response:
Use:
#convert values to datetimes
df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])
#get minutes by substract minimal datetime per group
df_in['g']=(df_in['Timestamp'].sub(df_in.groupby('SessionId')['Timestamp'].transform('min'))
.dt.total_seconds().div(60)
#binning to intervals
lab = ['events_first_1_minute','events_first_5_minute','events_first_10_minute',
'events_after_10_minute']
df_in['g'] = pd.cut(df_in['g'],
bins=[0, 1, 5, 10, np.inf],
labels=lab, include_lowest=True)
#count values with cumulative sums
df = (pd.crosstab(df_in['SessionId'], df_in['g'])
.cumsum(axis=1)
.rename(columns=str)
.reset_index()
.rename_axis(None, axis=1))
print (df)
SessionId events_first_1_minute events_first_5_minute \
0 A 2 4
1 B 1 2
2 C 1 1
events_first_10_minute events_after_10_minute
0 4 5
1 2 3
2 2 2