I have a big amount of timeseries sensor data in a pandas dataframe. The resolution of the data is one observation every 15 minutes for 1 month for 876 sensors.
The data has some daily seasonality and some faulty measurements in single sensors on about 50% of the observations.
I want to remove the seasonality.
df.diff(periods=96)
This does not work, because then I have an outlier on 2 days (the day with the actual faulty measurement and the day after.
Therefore I wrote this snippet of code which does what it should and works fine:
for index in df.index:
for column in df.columns:
df[column][index] = df[column][index] - (
df[column][df.index % 96 == index % 96]).mean()
The problem is that this is incredibly slow. Is there a way to achieve the same thing with a pandas function significantly faster?
CodePudding user response:
Iterating over a DataFrame/ Series should be your last resort, it's very slow.
In this case, you can use groupby
transform
to compute the mean of each season for all the columns, and then subtract with from your DataFrame in a vectorized way.
Based on your code, it seems that this should work
period = 96
season_mean = df.groupby(df.index % period).transform('mean')
df -= season_mean
Or, if you want
period = 96
df = df.groupby(df.index % period).transform(lambda g: g - g.mean())