I am trying to clean spikes in data in time series data in Pandas dataframe.
value = 5000
for index, row in gauteng_df.iterrows():
if index == gauteng_df.shape[0]-1:
break
upper, lower = row['Admissions to Date'] value, row['Admissions to Date'] - value
a = gauteng_df.iloc[index 1]['Admissions to Date']
if a > upper or a < lower:
a = (gauteng_df.iloc[index-1]['Admissions to Date'] gauteng_df.iloc[index 1]['Admissions to Date'])/2
gauteng_df.iloc[index]['Admissions to Date'] = a
I tried to reference the subsequent data point. If the current data point falls outside of the interval of the subsequent data point (i.e point - value), the current data point will be replaced by the average of the previous data point and the next data point. Unfortunately, when I tried to plot the new graph, there are no changes reflected, and the spikes are still there.
I would appreciate any help in this! Also, df.iterrows()
might not be the most efficient method so I would appreciate any help on a better method to replace the spikes values.
CodePudding user response:
Here is an alternative approach that might save you the trouble of iterating over DataFrame values: scipy.signal.find_peaks
.
import pandas as pd
import numpy as np
from scipy.signal import find_peaks
# Example data with a peak and a valley
gauteng_df = pd.DataFrame({'Admissions to Date':
[8000, 4500, 12000, 5500,
3000, 7500, 1000, 8500]
})
# Peak detection threshold
value = 5000
# `prominence` sets minimum height above surrounding
# signal at which a given value is considered a peak
peak_idx = find_peaks(gauteng_df['Admissions to Date'], prominence=value)[0]
# To detect valleys deeper than `value`,
# run find_peaks on negative of data
valley_idx = find_peaks(-gauteng_df['Admissions to Date'], prominence=value)[0]
# Combine indexes of peaks and valleys into a single array
idx = np.concatenate((peak_idx, valley_idx))
# Build an indicator column of peaks and valleys, or outliers
gauteng_df['outlier'] = False
gauteng_df.loc[idx, 'outlier'] = True
# Replace each outlier value with NaN
gauteng_df.loc[gauteng_df['outlier'], 'Admissions to Date'] = np.nan
# Interpolate over NaNs just created with default linear method
gauteng_df['Interpolated'] = (gauteng_df['Admissions to Date']
.interpolate()
.astype(int))
# Result
print(gauteng_df)
Admissions to Date outlier Interpolated
0 8000.0 False 8000
1 4500.0 False 4500
2 NaN True 5000
3 5500.0 False 5500
4 3000.0 False 3000
5 7500.0 False 7500
6 NaN True 8000
7 8500.0 False 8500
CodePudding user response:
Assuming your dataframe is sorted by time, create a new column with the previous row value...
df['Previous_admissions_value'] = df[['Admissions to Date']].shift(1, fill_value=0)
...and another new column with the next row value:
df['Next_admissions_value'] = df[['Admissions to Date']].shift(-1, fill_value=0)
Since the first and last rows do not have previous and next row values respectively, they will get filled with 0 if using code above. You can change them to some other value if needed manually updating to the desired value(s).
Then check for condition and make updates:
import numpy
df['update_condition'] = np.where(abs(df['Admissions to Date'] - df['Next_admissions_value']) > value, 1, 0)
df['Admissions to Date'] = np.where(df['update_condition'] > 0,
(df['Next_admissions_value'] df['Previous_admissions_value']) / 2.0,
df['Admissions to Date'])
CodePudding user response:
Use Series.diff
and Series.shift
:
Spikes occur when
diff()
ordiff(-1)
(the neighboring differences) are greater than thevalue
:value = 5000 spikes = (gauteng_df['Admissions to Date'].diff().abs().gt(value) | gauteng_df['Admissions to Date'].diff(-1).abs().gt(value))
At the spike locations, add
shift()
andshift(-1)
(the neighboring values) and divide by 2:df.loc[spikes] = (gauteng_df['Admissions to Date'].shift() .add(gauteng_df['Admissions to Date'].shift(-1)) .div(2))
This is a fully vectorized approach, as opposed to iterrows()
, which (as you suspected) should be avoided when possible.