I have an application log I am pulling into a pandas dataframe, parsing various pieces of each event into separate columns, that is similar to this:
data1 = {'timestamp': ['01-01-2021 12:00:00','01-01-2021 12:01:00','01-01-2021 12:02:00','01-01-2021 12:03:00','01-01-2021 12:04:00','01-01-2021 12:05:00','01-01-2021 12:06:00','01-01-2021 12:07:00','01-01-2021 12:08:00','01-01-2021 12:09:00','01-01-2021 12:10:00','01-01-2021 12:11:00','01-01-2021 12:12:00','01-01-2021 12:13:00','01-01-2021 12:14:00'],
'event': ['start','x','y','start','z','end','x','end','start','x','end','start','q','end','start'],
'account': ['bob','bob','bob','jane','bob','bob','jane','jane','todd','todd','todd','bob','bob','bob','ned'] }
df1 = pd.DataFrame(data1)
print(df1)
timestamp event account
0 01-01-2021 12:00:00 start bob
1 01-01-2021 12:01:00 x bob
2 01-01-2021 12:02:00 y bob
3 01-01-2021 12:03:00 start jane
4 01-01-2021 12:04:00 z bob
5 01-01-2021 12:05:00 end bob
6 01-01-2021 12:06:00 x jane
7 01-01-2021 12:07:00 end jane
8 01-01-2021 12:08:00 start todd
9 01-01-2021 12:09:00 x todd
10 01-01-2021 12:10:00 end todd
11 01-01-2021 12:11:00 start bob
12 01-01-2021 12:12:00 q bob
13 01-01-2021 12:13:00 end bob
14 01-01-2021 12:14:00 start ned
Pretty striaght forward log. The timestamps are strings but can easily be converted to datetime objects if needed. The log lists various actions performed by the user, but what I am interested in doing is generating a report of the account sessions by user. Something like this:
account start end
0 bob 01-01-2021 12:00:00 01-01-2021 12:05:00
1 jane 01-01-2021 12:03:00 01-01-2021 12:07:00
2 todd 01-01-2021 12:08:00 01-01-2021 12:10:00
3 bob 01-01-2021 12:11:00 01-01-2021 12:13:00
4 ned 01-01-2021 12:14:00 NaN
It's easy enough to group by start and end times, the problem thing I can't figure out is how to do this when a user has multiple sessions in the time period I am looking at. In the pseudo log data above Bob has 2 sessions opened, but could in theory have opened 100 sessions depending on the date range of the data I'm looking at. The application in question does restrict users to 1 session at a time, so I shouldn't see the same account with 2 sessions open at the same time.
How do I group the session start/end times together into 2 columns like this by account when the account could have multiple sessions in the data?
CodePudding user response:
Try:
(df1.assign(idx=df1.event.eq('start').groupby(df1.account).cumsum()) # enumerate the `start` event by account
.loc[lambda x: x['event'].isin(['start','end'])] # only keep `start` and `end` events
.set_index(['idx','account','event']) # set index and unstack
['timestamp'].unstack() # then unstack
.reset_index(level=1).reset_index(drop=True) # tidying up
)
Output:
event account end start
0 bob 01-01-2021 12:05:00 01-01-2021 12:00:00
1 jane 01-01-2021 12:07:00 01-01-2021 12:03:00
2 ned NaN 01-01-2021 12:14:00
3 todd 01-01-2021 12:10:00 01-01-2021 12:08:00
4 bob 01-01-2021 12:13:00 01-01-2021 12:11:00
CodePudding user response:
You could identify the starts and use pivot_table
:
(df1.assign(idx=df1['event'].eq('start').groupby(df1['account']).cumsum())
.pivot_table(index=['account', 'idx'], columns='event', values='timestamp', aggfunc='first')
[['start', 'end']]
)
Output:
event start end
account idx
bob 1 01-01-2021 12:00:00 01-01-2021 12:05:00
2 01-01-2021 12:11:00 01-01-2021 12:13:00
jane 1 01-01-2021 12:03:00 01-01-2021 12:07:00
ned 1 01-01-2021 12:14:00 NaN
todd 1 01-01-2021 12:08:00 01-01-2021 12:10:00