Home > Blockchain >  Clean a Series or DataFrame of values, according to condition
Clean a Series or DataFrame of values, according to condition

Time:08-24

Good Morning, I have a Series like the following.

Time Temperature
2019-01-02 02:00:00 14.95
2019-01-02 03:00:00 15.0
2019-01-02 04:00:00 37.0
2019-01-02 05:00:00 15.0
2019-01-02 06:00:00 15.5

I would like to replace all values that do not follow the trend with a NAN. (e.g. the value 37). I was thinking of inserting a condition that considers the value in the previous row. But I don't know if there is a faster way.

CodePudding user response:

You could use find_peaks to get the values not following the trend (=peaks). find_peaks offers a variety of methods to define what is a peak.

from scipy.signal import find_peaks

temp = df.Temperature.to_numpy()
idx, _ = find_peaks(temp, threshold=5)
temp[idx] = np.nan

df.Temperature = temp

CodePudding user response:

You can do simply:

df.loc[df.Temperature - df.Temperature.shift(-1) > 0, 'Temperature'] = np.nan

df:

Time    Temperature
2019-01-02 02:00:00 14.95
2019-01-02 03:00:00 15.00
2019-01-02 04:00:00 NaN
2019-01-02 05:00:00 15.00
2019-01-02 06:00:00 15.50

CodePudding user response:

You might have to define more tightly what you mean by "follow the trend", but I'll give an example for, say, a point that is more than 1.5 times the mean of points within a 5 timeslot window.

You could use pandas Series.rolling() to get a local rolling mean and then use pandas series slice indexing to apply the condition.

# Make some random data with an outlier
data_points = 48
random_data = np.random.random(data_points)
temps = random_data * 2   14
temps[6] = 37.0
times = pd.date_range(start="2019-01-02 02:00:00", freq="H", periods=data_points)

s = pd.Series(data=temps, index=times)

# See data with the outlier
print(s)

# Use pandas Series.rolling() to find local rolling mean
rolling_mean = s.rolling(5,min_periods=1,center=True).mean()

# Use Pandas slice indesxing to alter only values > 1.5 times the rolling mean
s[s > rolling_mean * 1.5]=float("NaN")

# Outlier replaced with NaN
print(s)
  • Related