I have a dataframe with a list of events, a column for an indicator for a criterion, and a column for a timestamp.
For each event, if the indicator is true, I want to see if the event lasted more than one period, and for how long.
In terms of an expected output, I have provided an example below. For the duration column, A is true for only one time period so it will be coded as 1. Then, A is False for the next period, so it will code that as 0. Then, A is true for 2 time periods, so the duration is two, the next entry can be coded as 0 since I am only interested in the first entry, and so on.
id target time duration
0 A True 2023-01-22 11:00:00 1
3 A False 2023-01-22 11:05:00 0
6 A True 2023-01-22 11:10:00 2
9 A True 2023-01-22 11:15:00 0
12 A False 2023-01-22 11:20:00 0
But I have no idea how to do this.
A sample dataframe is included below
import pandas as pd
time_test = pd.DataFrame({'id':[
'A','B','C','A','B','C',
'A','B','C','A','B','C',
'A','B','C','A','B','C'],
'target':[
'True','True','True','False','True','True',
'True','False','True','True','True','True',
'False','True','False','True','False','True'],
'time':[
'11:00','11:00','11:00','11:05','11:05','11:05',
'11:10','11:10','11:10','11:15','11:15','11:15',
'11:20','11:20','11:20','11:25','11:25','11:25']})
time_test =time_test.sort_values(['id','time'])
time_test['time'] =pd.to_datetime(time_test['time'])
time_test
EDIT: I need to provide some clarification about the expected output
Let's take group B as an example. An event occurs for B at 11:00, indicated by the "True" under target. At 11:05, the event is still occurring so duration should be 2 for the row 1 B True 2023-01-22 11:00:00
. I am not interested in the row following so that can coded as 0. So in a since 0 would represent both "already accounted for" and the absence of an event.
At 11:10 that event is not occurring so the summation would re-set.
At 11:15 another event is occurring, and at 11:20 that event is still going, so the value for the first row should be 2.
In the end, the values for B should be 2,0,0,2,0,0.
I can see why this method would be confusing but I hope my explanation makes since. My data is in 5 minute chunks so I figured I could just count the number of chunks to see how long an event lasted for, instead of using a start and end time to calculate the elapsed time(but maybe that would be easier?)
CodePudding user response:
Annotated code
# Convert the target column to boolean
mask = time_test['target'].eq('True')
# Create subgroups to identify blocks of consecutive True's
time_test['subgrps'] = (~mask).cumsum()[mask]
# Group the target mask by id and subgrps
g = mask.groupby([time_test['id'], time_test['subgrps']])
# Create a boolean mask to identify dupes per id and subgrps
dupes = time_test.duplicated(subset=['id', 'subgrps'])
# Sum the True value per group and mask the duplicates
time_test['duration'] = g.transform('sum').mask(dupes).fillna(0)
Result
id target time subgrps duration
0 A True 2023-01-22 11:00:00 0.0 1.0
3 A False 2023-01-22 11:05:00 NaN 0.0
6 A True 2023-01-22 11:10:00 1.0 2.0
9 A True 2023-01-22 11:15:00 1.0 0.0
12 A False 2023-01-22 11:20:00 NaN 0.0
15 A True 2023-01-22 11:25:00 2.0 1.0
1 B True 2023-01-22 11:00:00 2.0 2.0
4 B True 2023-01-22 11:05:00 2.0 0.0
7 B False 2023-01-22 11:10:00 NaN 0.0
10 B True 2023-01-22 11:15:00 3.0 2.0
13 B True 2023-01-22 11:20:00 3.0 0.0
16 B False 2023-01-22 11:25:00 NaN 0.0
2 C True 2023-01-22 11:00:00 4.0 4.0
5 C True 2023-01-22 11:05:00 4.0 0.0
8 C True 2023-01-22 11:10:00 4.0 0.0
11 C True 2023-01-22 11:15:00 4.0 0.0
14 C False 2023-01-22 11:20:00 NaN 0.0
17 C True 2023-01-22 11:25:00 5.0 1.0