Create data
actions = ['Start','Action1','Action2','Pause','Actoin3','Resume','Action1','Finish','Start','Action1','Finish']
start_date = datetime.datetime.strptime('14/10/21 09:00:00', '%d/%m/%y %H:%M:%S')
date_list = [start_date datetime.timedelta(seconds=x) for x in range(0,11)]
values = [1,1,2,1,2,1,5,1,1,1,1]
df = pd.DataFrame({'ActionType': actions,
'Timestamp': date_list,
'Value': values})
ActionType | Timestamp | Value |
---|---|---|
Start | 2021-10-14 09:00:00 | 1 |
Action1 | 2021-10-14 09:00:01 | 1 |
Action2 | 2021-10-14 09:00:02 | 2 |
Pause | 2021-10-14 09:00:03 | 1 |
Action3 | 2021-10-14 09:00:04 | 2 |
Restart | 2021-10-14 09:00:05 | 1 |
Action1 | 2021-10-14 09:00:06 | 5 |
Finish | 2021-10-14 09:00:07 | 1 |
Start | 2021-10-14 09:00:08 | 1 |
Action1 | 2021-10-14 09:00:09 | 1 |
Finish | 2021-10-14 09:00:010 | 1 |
See how there are two "sessions" taking place. I want to mark each session in a new column.
- How do I get the rows in-between Start and Finish rows? (assume sorted)
- Similarly, how can I filter-out pauses in the session? For example, to calculate RealTimeElapsed column (or simply make a DuringPause boolean column)
The output should look like this:
output = pd.DataFrame({'Session:': [0,0,0,0,0,0,0,0,1,1,1],
'ActionType': actions,
'Timestamp': date_list,
'RealTimeElapsed': [0,1,2,3,3,3,4,5,0,1,2],
'Value': values
})
Session | ActionType | Timestamp | RealTimeElapsed | Value |
---|---|---|---|---|
0 | Start | 2021-10-14 09:00:00 | 0 | 1 |
0 | Action1 | 2021-10-14 09:00:01 | 1 | 1 |
0 | Action2 | 2021-10-14 09:00:02 | 2 | 1 |
0 | Pause | 2021-10-14 09:00:03 | 3 | 1 |
0 | Action3 | 2021-10-14 09:00:04 | 3 | 1 |
0 | Resume | 2021-10-14 09:00:05 | 3 | 1 |
0 | Action1 | 2021-10-14 09:00:06 | 4 | 1 |
0 | Finish | 2021-10-14 09:00:07 | 5 | 1 |
1 | Start | 2021-10-14 09:00:08 | 0 | 1 |
1 | Action1 | 2021-10-14 09:00:09 | 1 | 1 |
1 | Finish | 2021-10-14 09:00:010 | 2 | 1 |
Already considered:
- Looping: this is a poor practice (my data is very large), but let me know if this is the only feasible solution.
- Shift: pandas has a shift function, but I only know how to use it for a fixed number of rows, there's no conditional things (e.g. start/finish/pause/resume) I can treat differently
- Groupby('Type').unstack() and take differences between times: I can't do this because I need to maintain the Value column
CodePudding user response:
My solution is:
df["Session"] = np.where(df["ActionType"]=="Start",1,0).cumsum()-1
df["RealTimeElapsed"] = df.ActionType.str.contains("Action1|Action2|Pause|Finish").factorize()[0]
df["RealTimeElapsed"] = df.groupby("Session")["RealTimeElapsed"].cumsum()
df = df[["Session","ActionType","Timestamp","RealTimeElapsed","Value"]]
df
returns
Session ActionType Timestamp RealTimeElapsed Value
0 0 Start 2021-10-14 09:00:00 0 1
1 0 Action1 2021-10-14 09:00:01 1 1
2 0 Action2 2021-10-14 09:00:02 2 2
3 0 Pause 2021-10-14 09:00:03 3 1
4 0 Actoin3 2021-10-14 09:00:04 3 2
5 0 Resume 2021-10-14 09:00:05 3 1
6 0 Action1 2021-10-14 09:00:06 4 5
7 0 Finish 2021-10-14 09:00:07 5 1
8 1 Start 2021-10-14 09:00:08 0 1
9 1 Action1 2021-10-14 09:00:09 1 1
10 1 Finish 2021-10-14 09:00:10 2 1