Home > Blockchain >  Pandas remove daily seasonality from data by substracting daily mean
Pandas remove daily seasonality from data by substracting daily mean

Time:11-15

I have a big amount of timeseries sensor data in a pandas dataframe. The resolution of the data is one observation every 15 minutes for 1 month for 876 sensors.

The data has some daily seasonality and some faulty measurements in single sensors on about 50% of the observations.

I want to remove the seasonality.

df.diff(periods=96)

This does not work, because then I have an outlier on 2 days (the day with the actual faulty measurement and the day after.

Therefore I wrote this snippet of code which does what it should and works fine:

  for index in df.index:
    for column in df.columns:
        df[column][index] = df[column][index] - (
            df[column][df.index % 96 == index % 96]).mean()

The problem is that this is incredibly slow. Is there a way to achieve the same thing with a pandas function significantly faster?

CodePudding user response:

Iterating over a DataFrame/ Series should be your last resort, it's very slow.

In this case, you can use groupby transform to compute the mean of each season for all the columns, and then subtract with from your DataFrame in a vectorized way.

Based on your code, it seems that this should work

period = 96
season_mean = df.groupby(df.index % period).transform('mean')
df -= season_mean

Or, if you want

period = 96
df = df.groupby(df.index % period).transform(lambda g: g - g.mean()) 
  • Related