I recently started working with timeseries data and I want to find the start and end times of values in a column that exceed a defined threshold of 150.
Datetime Value
0 11/30/2022 0:00 100
1 11/30/2022 0:01 110
2 11/30/2022 0:02 105
3 11/30/2022 0:03 105
4 11/30/2022 0:04 155
5 11/30/2022 0:05 160
6 11/30/2022 0:06 160
7 11/30/2022 0:07 160
8 11/30/2022 0:08 160
9 11/30/2022 0:09 165
10 11/30/2022 0:10 165
11 11/30/2022 0:11 160
12 11/30/2022 0:12 160
13 11/30/2022 0:13 150
14 11/30/2022 0:14 120
15 11/30/2022 0:15 110
16 11/30/2022 0:16 115
17 11/30/2022 0:17 115
18 11/30/2022 0:18 130
19 11/30/2022 0:19 145
20 11/30/2022 0:20 150
21 11/30/2022 0:21 155
22 11/30/2022 0:22 155
23 11/30/2022 0:23 155
24 11/30/2022 0:24 155
25 11/30/2022 0:25 155
26 11/30/2022 0:26 140
27 11/30/2022 0:27 130
28 11/30/2022 0:28 120
I want to get an output in the form of a dataframe having multiple start and end times along with the duration in seconds:
Start_Time End_Time Duration
0 2022-11-30 00:04:00 2022-11-30 00:13:00 540.0
1 2022-11-30 00:20:00 2022-11-30 00:25:00 300.0
I can compute the duration using df['Duration'] = (df['End_Time']-df['Start_Time']).dt.total_seconds()
however I cannot get those start and end times. Can someone please help me out with this?
CodePudding user response:
First, creating your dataframe:
start = pd.to_datetime('30-11-2022, 0:0:0', format="%d-%m-%Y, %H:%M:%S")
df = pd.DataFrame({
"StartTime": pd.date_range(start=start, freq="1s", periods=29),
"Value": [100,110,105,105,155,160,160,160,160,165,165,160,160,150,120,110,115,115,130,145,150,155,155,155,155,155,140,130,120]})
Next, let's filter to only the values you want:
df = df.loc[df.Value > 150].copy()
Then, let's combine the consecutive windows by shift
ing the start and end columns, then checking if they line up with each other:
merge_endtimes = df.EndTime[df.StartTime.shift(-1) != df.EndTime].reset_index(drop=True)
merge_starttimes = df.StartTime[df.EndTime.shift(1) != df.StartTime].reset_index(drop=True)
merged_df = pd.concat([merge_starttimes, merge_endtimes], axis=1)
merged_df['Duration'] = merged_df['EndTime'] - merged_df['StartTime']
CodePudding user response:
Code
s = df['Value'].ge(150)
grouper = s.ne(s.shift(1)).cumsum()
df[s].groupby(grouper)['Datetime'].agg([min, max])
output:
min max
Value
2 2022-11-30 00:00:04 2022-11-30 00:00:13
4 2022-11-30 00:00:20 2022-11-30 00:00:25
change index
& columns
of output
(df[s].groupby(grouper)['Datetime'].agg([min, max])
.set_axis(['start_time', 'end_time'], axis=1)
.reset_index(drop=True))
result:
start_time end_time
0 2022-11-30 00:00:04 2022-11-30 00:00:13
1 2022-11-30 00:00:20 2022-11-30 00:00:25