Considering the following dataframe:
Temperature Datetime
1 24.72 2021-01-01 10:00:00
2 25.76 2021-01-01 11:00:00
3 40 2021-01-01 12:00:00
4 25.31 2021-01-01 13:00:00
5 26.21 2021-01-01 14:00:00
6 26.59 2021-01-01 15:00:00
7 26.64 2021-01-01 20:00:00
8 26.38 2021-01-01 21:00:00
9 45 2021-01-01 22:00:00
10 26.23 2021-01-01 23:00:00
... ... ...
What we want to achieve is to remove outliers, as for example, in id 3 the temperature is 40 and it is clearly an outlier. We want to remove the whole row of id 3. We have already read this thread: Outlier detection based on the moving mean in Python.
In the thread it is described, that the outliers can be removed with the following code:
# Import Libraries
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'Temperatura': [24.72, 25.76, 40, 25.31, 26.21, 26.59],
'Date':[2.3,4.6,7.0,9.3,15.6,17.9]
})
# Set threshold for difference with rolling median
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temperatura'].rolling(window=3).median()
# Calculate difference
df['diff'] = df['Temperatura'] - df['rolling_temp']
# Flag rows to be dropped as `1`
df['drop_flag'] = np.where((df['diff']>upper_threshold)|(df['diff']<lower_threshold),1,0)
# Drop flagged rows
df = df[df['drop_flag']!=1]
df = df.drop(['rolling_temp', 'rolling_temp', 'diff', 'drop_flag'],axis=1)
But we want to extend it even further, so that the median restarts whenever there is a missing value. So when considering the dataframe we have illustrated, we see an example where values are missing:
Temperature Datetime
1 24.72 2021-01-01 10:00:00
2 25.76 2021-01-01 11:00:00
3 40 2021-01-01 12:00:00
4 25.31 2021-01-01 13:00:00
5 26.21 2021-01-01 14:00:00
6 26.59 2021-01-01 15:00:00
7 26.64 2021-01-01 20:00:00 <-- Reset due to missing data between this point and the one before
8 26.38 2021-01-01 21:00:00
9 45 2021-01-01 22:00:00
10 26.23 2021-01-01 23:00:00
... ... ...
And what we want is that the code that removes the outliers also considers datetime, so that in id 7, we do notice that the datetime is 5 hours after id 6, and we can therefore conclude that data is missing, and we therefore want to reset the median, as we dont want a rolling median/mean that uses data not relevant in the outlier detection. We might have examples where data is missing for hours or maybe even days, and if the rolling median doesn't consider this, it will provide a bad data cleaning. An ideal threshold for this would be 1 hour, so if row two is not exactly one hour after row 1, then reset the median. Is this possible?
CodePudding user response:
In my opinion, you should use the datetime capabilities to compute your moving average. Something like computing the average temperature of the n hours around the given time, then compare your current temperature using a threshold.
Something like:
df['Datetime'] = pd.to_datetime(df['Datetime'])
s = (df
.rolling('5h', center=True, on='Datetime')
['Temperature'].mean()
)
# 10° diff, absolute threshold
df['outlier'] = df['Temperature'].sub(s).abs().gt(10)
df.loc[mask, 'outlier'] = True
# to drop the rows:
# df = df.loc[~mask]
output:
Temperature Datetime outlier
1 24.72 2021-01-01 10:00:00 False
2 25.76 2021-01-01 11:00:00 False
3 40.00 2021-01-01 12:00:00 True
4 25.31 2021-01-01 13:00:00 False
5 26.21 2021-01-01 14:00:00 False
6 26.59 2021-01-01 15:00:00 False
7 26.64 2021-01-01 20:00:00 False
8 26.38 2021-01-01 21:00:00 False
9 45.00 2021-01-01 22:00:00 True
10 26.23 2021-01-01 23:00:00 False