Pandas: Find Start and End times of timeseries data when value is above threshold in column-CodePudding

I recently started working with timeseries data and I want to find the start and end times of values in a column that exceed a defined threshold of 150.

    Datetime    Value
0   11/30/2022 0:00 100
1   11/30/2022 0:01 110
2   11/30/2022 0:02 105
3   11/30/2022 0:03 105
4   11/30/2022 0:04 155
5   11/30/2022 0:05 160
6   11/30/2022 0:06 160
7   11/30/2022 0:07 160
8   11/30/2022 0:08 160
9   11/30/2022 0:09 165
10  11/30/2022 0:10 165
11  11/30/2022 0:11 160
12  11/30/2022 0:12 160
13  11/30/2022 0:13 150
14  11/30/2022 0:14 120
15  11/30/2022 0:15 110
16  11/30/2022 0:16 115
17  11/30/2022 0:17 115
18  11/30/2022 0:18 130
19  11/30/2022 0:19 145
20  11/30/2022 0:20 150
21  11/30/2022 0:21 155
22  11/30/2022 0:22 155
23  11/30/2022 0:23 155
24  11/30/2022 0:24 155
25  11/30/2022 0:25 155
26  11/30/2022 0:26 140
27  11/30/2022 0:27 130
28  11/30/2022 0:28 120

I want to get an output in the form of a dataframe having multiple start and end times along with the duration in seconds:

    Start_Time          End_Time              Duration
0   2022-11-30 00:04:00 2022-11-30 00:13:00   540.0
1   2022-11-30 00:20:00 2022-11-30 00:25:00   300.0

I can compute the duration using df['Duration'] = (df['End_Time']-df['Start_Time']).dt.total_seconds() however I cannot get those start and end times. Can someone please help me out with this?

CodePudding user response：

First, creating your dataframe:

start = pd.to_datetime('30-11-2022, 0:0:0', format="%d-%m-%Y, %H:%M:%S")
df = pd.DataFrame({
    "StartTime": pd.date_range(start=start, freq="1s", periods=29), 
    "Value": [100,110,105,105,155,160,160,160,160,165,165,160,160,150,120,110,115,115,130,145,150,155,155,155,155,155,140,130,120]})

Next, let's filter to only the values you want:

df = df.loc[df.Value > 150].copy()

Then, let's combine the consecutive windows by shifting the start and end columns, then checking if they line up with each other:

merge_endtimes = df.EndTime[df.StartTime.shift(-1) != df.EndTime].reset_index(drop=True)
merge_starttimes = df.StartTime[df.EndTime.shift(1) != df.StartTime].reset_index(drop=True)
merged_df = pd.concat([merge_starttimes, merge_endtimes], axis=1)
merged_df['Duration'] = merged_df['EndTime'] - merged_df['StartTime']

CodePudding user response：

Code

s = df['Value'].ge(150)
grouper = s.ne(s.shift(1)).cumsum()
df[s].groupby(grouper)['Datetime'].agg([min, max])

output:

        min             max
Value       
2   2022-11-30 00:00:04 2022-11-30 00:00:13
4   2022-11-30 00:00:20 2022-11-30 00:00:25

change index & columns of output

(df[s].groupby(grouper)['Datetime'].agg([min, max])
 .set_axis(['start_time', 'end_time'], axis=1)
 .reset_index(drop=True))

result:

    start_time          end_time
0   2022-11-30 00:00:04 2022-11-30 00:00:13
1   2022-11-30 00:00:20 2022-11-30 00:00:25