Cleaning spikes in time series data using neighbouring data points-CodePudding

I am trying to clean spikes in data in time series data in Pandas dataframe.

value = 5000
for index, row in gauteng_df.iterrows():
    if index == gauteng_df.shape[0]-1:
        break
    upper, lower = row['Admissions to Date']   value, row['Admissions to Date'] - value
    a = gauteng_df.iloc[index 1]['Admissions to Date']
    if a > upper or a < lower:
        a = (gauteng_df.iloc[index-1]['Admissions to Date']   gauteng_df.iloc[index 1]['Admissions to Date'])/2
        gauteng_df.iloc[index]['Admissions to Date'] = a

I tried to reference the subsequent data point. If the current data point falls outside of the interval of the subsequent data point (i.e point - value), the current data point will be replaced by the average of the previous data point and the next data point. Unfortunately, when I tried to plot the new graph, there are no changes reflected, and the spikes are still there.

I would appreciate any help in this! Also, df.iterrows() might not be the most efficient method so I would appreciate any help on a better method to replace the spikes values.

CodePudding user response：

Here is an alternative approach that might save you the trouble of iterating over DataFrame values: scipy.signal.find_peaks.

import pandas as pd
import numpy as np
from scipy.signal import find_peaks

# Example data with a peak and a valley
gauteng_df = pd.DataFrame({'Admissions to Date':
                           [8000, 4500, 12000, 5500, 
                            3000, 7500,  1000, 8500]
})

# Peak detection threshold
value = 5000

# `prominence` sets minimum height above surrounding 
# signal at which a given value is considered a peak
peak_idx = find_peaks(gauteng_df['Admissions to Date'], prominence=value)[0]

# To detect valleys deeper than `value`, 
# run find_peaks on negative of data
valley_idx = find_peaks(-gauteng_df['Admissions to Date'], prominence=value)[0]

# Combine indexes of peaks and valleys into a single array
idx = np.concatenate((peak_idx, valley_idx))

# Build an indicator column of peaks and valleys, or outliers
gauteng_df['outlier'] = False
gauteng_df.loc[idx, 'outlier'] = True

# Replace each outlier value with NaN
gauteng_df.loc[gauteng_df['outlier'], 'Admissions to Date'] = np.nan

# Interpolate over NaNs just created with default linear method
gauteng_df['Interpolated'] = (gauteng_df['Admissions to Date']
                             .interpolate()
                             .astype(int))

# Result
print(gauteng_df)

   Admissions to Date  outlier  Interpolated
0              8000.0    False          8000
1              4500.0    False          4500
2                 NaN     True          5000
3              5500.0    False          5500
4              3000.0    False          3000
5              7500.0    False          7500
6                 NaN     True          8000
7              8500.0    False          8500

CodePudding user response：

Assuming your dataframe is sorted by time, create a new column with the previous row value...

df['Previous_admissions_value'] = df[['Admissions to Date']].shift(1, fill_value=0)

...and another new column with the next row value:

df['Next_admissions_value'] = df[['Admissions to Date']].shift(-1, fill_value=0)

Since the first and last rows do not have previous and next row values respectively, they will get filled with 0 if using code above. You can change them to some other value if needed manually updating to the desired value(s).

Then check for condition and make updates:

import numpy 

df['update_condition'] = np.where(abs(df['Admissions to Date'] - df['Next_admissions_value']) > value, 1, 0)

df['Admissions to Date'] = np.where(df['update_condition'] > 0,
                                    (df['Next_admissions_value']   df['Previous_admissions_value']) / 2.0,
                                     df['Admissions to Date'])

CodePudding user response：

Use Series.diff and Series.shift:

Spikes occur when diff() or diff(-1) (the neighboring differences) are greater than the value:

value = 5000
spikes = (gauteng_df['Admissions to Date'].diff().abs().gt(value)
          | gauteng_df['Admissions to Date'].diff(-1).abs().gt(value))

At the spike locations, add shift() and shift(-1) (the neighboring values) and divide by 2:

df.loc[spikes] = (gauteng_df['Admissions to Date'].shift()
                  .add(gauteng_df['Admissions to Date'].shift(-1))
                  .div(2))

This is a fully vectorized approach, as opposed to iterrows(), which (as you suspected) should be avoided when possible.