Grouping dataframe rows by start/end timestamps-CodePudding

I have an application log I am pulling into a pandas dataframe, parsing various pieces of each event into separate columns, that is similar to this:

data1 = {'timestamp': ['01-01-2021 12:00:00','01-01-2021 12:01:00','01-01-2021 12:02:00','01-01-2021 12:03:00','01-01-2021 12:04:00','01-01-2021 12:05:00','01-01-2021 12:06:00','01-01-2021 12:07:00','01-01-2021 12:08:00','01-01-2021 12:09:00','01-01-2021 12:10:00','01-01-2021 12:11:00','01-01-2021 12:12:00','01-01-2021 12:13:00','01-01-2021 12:14:00'],
    'event':     ['start','x','y','start','z','end','x','end','start','x','end','start','q','end','start'],
    'account':   ['bob','bob','bob','jane','bob','bob','jane','jane','todd','todd','todd','bob','bob','bob','ned'] }
df1 = pd.DataFrame(data1)
print(df1)

    timestamp            event     account
0   01-01-2021 12:00:00  start     bob
1   01-01-2021 12:01:00      x     bob
2   01-01-2021 12:02:00      y     bob
3   01-01-2021 12:03:00  start    jane
4   01-01-2021 12:04:00      z     bob
5   01-01-2021 12:05:00    end     bob
6   01-01-2021 12:06:00      x    jane
7   01-01-2021 12:07:00    end    jane
8   01-01-2021 12:08:00  start    todd
9   01-01-2021 12:09:00      x    todd
10  01-01-2021 12:10:00    end    todd
11  01-01-2021 12:11:00  start     bob
12  01-01-2021 12:12:00      q     bob
13  01-01-2021 12:13:00    end     bob
14  01-01-2021 12:14:00  start     ned

Pretty striaght forward log. The timestamps are strings but can easily be converted to datetime objects if needed. The log lists various actions performed by the user, but what I am interested in doing is generating a report of the account sessions by user. Something like this:

     account  start                end
0    bob      01-01-2021 12:00:00  01-01-2021 12:05:00
1    jane     01-01-2021 12:03:00  01-01-2021 12:07:00
2    todd     01-01-2021 12:08:00  01-01-2021 12:10:00
3    bob      01-01-2021 12:11:00  01-01-2021 12:13:00
4    ned      01-01-2021 12:14:00                  NaN

It's easy enough to group by start and end times, the problem thing I can't figure out is how to do this when a user has multiple sessions in the time period I am looking at. In the pseudo log data above Bob has 2 sessions opened, but could in theory have opened 100 sessions depending on the date range of the data I'm looking at. The application in question does restrict users to 1 session at a time, so I shouldn't see the same account with 2 sessions open at the same time.

How do I group the session start/end times together into 2 columns like this by account when the account could have multiple sessions in the data?

CodePudding user response：

Try:

(df1.assign(idx=df1.event.eq('start').groupby(df1.account).cumsum())  # enumerate the `start` event by account
    .loc[lambda x: x['event'].isin(['start','end'])]                  # only keep `start` and `end` events
    .set_index(['idx','account','event'])                             # set index and unstack
    ['timestamp'].unstack()                               # then unstack
    .reset_index(level=1).reset_index(drop=True)          # tidying up
)

Output:

event account                  end                start
0         bob  01-01-2021 12:05:00  01-01-2021 12:00:00
1        jane  01-01-2021 12:07:00  01-01-2021 12:03:00
2         ned                  NaN  01-01-2021 12:14:00
3        todd  01-01-2021 12:10:00  01-01-2021 12:08:00
4         bob  01-01-2021 12:13:00  01-01-2021 12:11:00

CodePudding user response：

You could identify the starts and use pivot_table:

(df1.assign(idx=df1['event'].eq('start').groupby(df1['account']).cumsum())
    .pivot_table(index=['account', 'idx'], columns='event', values='timestamp', aggfunc='first')
    [['start', 'end']]
 )

Output:

event                      start                  end
account idx                                          
bob     1    01-01-2021 12:00:00  01-01-2021 12:05:00
        2    01-01-2021 12:11:00  01-01-2021 12:13:00
jane    1    01-01-2021 12:03:00  01-01-2021 12:07:00
ned     1    01-01-2021 12:14:00                  NaN
todd    1    01-01-2021 12:08:00  01-01-2021 12:10:00