filter dataframe where a value stays under a threshold for specified amount of time-CodePudding

I have a time series dataset which is described as follows:

timestamp,y
2019-08-01 00:00:00,772.0
2019-08-01 00:15:00,648.0
2019-08-01 00:30:00,652.0
2019-08-01 00:45:00,572.0
2019-08-01 01:00:00,604.0
2019-08-01 01:15:00,644.0
2019-08-01 01:30:00,544.0
...

What I am doing at the moment is filtering the dataframe where the values are under a specified threshold using something like:

filtered = df[lambda x: x["y"] < some_threshold]

However, what I would like is to get the rows where the threshold condition is met and remains like that for at least an hour i.e. 4 data points. So, in my example if the threshold is set to 650, it should return the last 4 rows and not the second row.

CodePudding user response：

Let's make your data worth looking at:

df.loc[7] = ['2019-08-01 5:30:00', 900]
df.loc[8] = ['2019-08-01 8:30:00', 500]
df.loc[9] = ['2019-08-01 12:30:00', 900]
df.timestamp = pd.to_datetime(df.timestamp)
df = df.set_index('timestamp')
df = df.resample('15T').interpolate()
print(df)

# Output:
                          y
timestamp
2019-08-01 00:00:00  772.00
2019-08-01 00:15:00  648.00
2019-08-01 00:30:00  652.00
2019-08-01 00:45:00  572.00
2019-08-01 01:00:00  604.00
2019-08-01 01:15:00  644.00
2019-08-01 01:30:00  544.00
2019-08-01 01:45:00  566.25
2019-08-01 02:00:00  588.50
2019-08-01 02:15:00  610.75
2019-08-01 02:30:00  633.00
2019-08-01 02:45:00  655.25
2019-08-01 03:00:00  677.50
2019-08-01 03:15:00  699.75
2019-08-01 03:30:00  722.00
2019-08-01 03:45:00  744.25
2019-08-01 04:00:00  766.50
2019-08-01 04:15:00  788.75
2019-08-01 04:30:00  811.00
2019-08-01 04:45:00  833.25
2019-08-01 05:00:00  855.50
2019-08-01 05:15:00  877.75
2019-08-01 05:30:00  900.00
2019-08-01 05:45:00  866.67
2019-08-01 06:00:00  833.33
2019-08-01 06:15:00  800.00
2019-08-01 06:30:00  766.67
2019-08-01 06:45:00  733.33
2019-08-01 07:00:00  700.00
2019-08-01 07:15:00  666.67
2019-08-01 07:30:00  633.33
2019-08-01 07:45:00  600.00
2019-08-01 08:00:00  566.67
2019-08-01 08:15:00  533.33
2019-08-01 08:30:00  500.00
2019-08-01 08:45:00  525.00
2019-08-01 09:00:00  550.00
2019-08-01 09:15:00  575.00
2019-08-01 09:30:00  600.00
2019-08-01 09:45:00  625.00
2019-08-01 10:00:00  650.00
2019-08-01 10:15:00  675.00
2019-08-01 10:30:00  700.00
2019-08-01 10:45:00  725.00
2019-08-01 11:00:00  750.00
2019-08-01 11:15:00  775.00
2019-08-01 11:30:00  800.00
2019-08-01 11:45:00  825.00
2019-08-01 12:00:00  850.00
2019-08-01 12:15:00  875.00
2019-08-01 12:30:00  900.00

Now we can apply some rolling logic, we'll count values if they're less than 650. If the count makes it to 4, then all fours and the 3 values before each 4, are what we want:

out = df[df.lt(650)
           .rolling('h')
           .sum()
           .eq(4)
           .replace(False, np.nan)
           .bfill(limit=3)
           .fillna(False)
           .values]
print(out)

# Output:
                              y
timestamp
2019-08-01 00:45:00  572.000000
2019-08-01 01:00:00  604.000000
2019-08-01 01:15:00  644.000000
2019-08-01 01:30:00  544.000000
2019-08-01 01:45:00  566.250000
2019-08-01 02:00:00  588.500000
2019-08-01 02:15:00  610.750000
2019-08-01 02:30:00  633.000000
2019-08-01 07:30:00  633.333333
2019-08-01 07:45:00  600.000000
2019-08-01 08:00:00  566.666667
2019-08-01 08:15:00  533.333333
2019-08-01 08:30:00  500.000000
2019-08-01 08:45:00  525.000000
2019-08-01 09:00:00  550.000000
2019-08-01 09:15:00  575.000000
2019-08-01 09:30:00  600.000000
2019-08-01 09:45:00  625.000000

CodePudding user response：

I'd suggest:

import numpy as np

time_delta = 1 # [unit: hours]

# convert timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'])

# forward: check where the condition is met
boolList = []
for i, row in df.iterrows():
    boolList.append(df[(df['timestamp'] <= row['timestamp']) & (abs(df['timestamp'] - row['timestamp']) / np.timedelta64(time_delta, 'h') < time_delta)]['y'].agg(max) < some_threshold)

# backward: set corresponding previous rows to True 
df['bool'] = boolList
for i, row in df.iterrows():
    if(row['bool']):
        df.loc[(df['timestamp'] < row['timestamp']) & (abs(df['timestamp'] - row['timestamp']) / np.timedelta64(time_delta, 'h') < time_delta), 'bool']=True
        
    
# filter
filtered = df[df['bool']][['timestamp', 'y']]
filtered

yielding

	timestamp	y
3	2019-08-01 00:45:00	572.0
4	2019-08-01 01:00:00	604.0
5	2019-08-01 01:15:00	644.0
6	2019-08-01 01:30:00	544.0

Explanation

For the sake of readability, the solution is not provided via list comprehensions.

It leveraged pd.to_datetime() and np.timedelta64() to filter those rows that fall within the previous hour. Subsequently, the observations y are assessed by checking if their respective maximum is below some_threshold. If so, the respective entry in boolList is set to True indicating that the entries of the previous hour all satisfied the condition. A second loop is required to mark those entries, that fell within one of those intervals: df['bool'] indicates all valid row entries. Finally, the df is filtered according to this bool column that is dropped from filtered.

The code hinges on pd.to_datetime() and np.timedelta64() to filter those rows that fall within the previous hour. Then, it assesses if each of the corresponding observations y is below the given value some_threshold. If this is true for each observation, the respective maximum of this subset will lie below it, too, justifying agg(max).

Generality

The dataframe you provided indicates observations that are sampled uniformly across time (every 15 minutes). Regardless, my proposed solution does not assume that this holds true for your entire dataframe. You can change the time interval via time_delta. However, nan entries in the timestamp or y column require additional work.