Home > OS >  Rolling mean before specific datetime
Rolling mean before specific datetime

Time:12-23

I have a dataframe where df.head(4) looks like this

order time delivered time duration
2022-11-05 13:03:31 2022-11-05 13:33:31 30
2022-11-05 13:20:31 2022-11-05 13:54:31 34
2022-11-05 13:34:31 2022-11-05 13:59:31 25
2022-11-05 13:58:31 2022-11-05 14:30:31 32

I want to calculate the rolling mean of the past 2 orders for each new order. However the rolling mean should not consider orders which has not been delivered yet

The last row rolling mean should be ((30 34)/2) = 32 and should not take 25 because this order has not been delivered when the last row order was placed

The resulting dataframe should look like this

order time delivered time duration rolling mean
2022-11-05 13:03:31 2022-11-05 13:33:31 30 NA
2022-11-05 13:20:31 2022-11-05 13:54:31 34 NA
2022-11-05 13:34:31 2022-11-05 13:59:31 25 30
2022-11-05 13:58:31 2022-11-05 14:30:31 32 32

I tried the following code

def rolling(df):
    dt=df['order time'].values
    df['rolling mean'] = (df.set_axis(dt)
                        .rolling(2)['duration']
                        .mean()
                        .set_axis(df.index)
                        .shift(1)
                      )
    return df

df_result =  df.apply(rolling)

However, this is averaging the last 2 orders even if the order is not delivered yet

CodePudding user response:

This will get you the output you desire.

from io import StringIO
import pandas as pd  # 1.5.2


string = StringIO("""order time delivered time  duration
2022-11-05 13:03:31 2022-11-05 13:33:31 30
2022-11-05 13:20:31 2022-11-05 13:54:31 34
2022-11-05 13:34:31 2022-11-05 13:59:31 25
2022-11-05 13:58:31 2022-11-05 14:30:31 32""")

df = pd.read_csv(string, sep="\\t", engine="python")

df["order time"] = pd.to_datetime(df["order time"])
df = df.set_index("order time")

df["rolling mean"] = df.rolling(
    window=2,
    min_periods=1
)["duration"].mean().shift(periods=2)

print(df)
                          delivered time  duration  rolling mean
order time                                                      
2022-11-05 13:03:31  2022-11-05 13:33:31        30           NaN
2022-11-05 13:20:31  2022-11-05 13:54:31        34           NaN
2022-11-05 13:34:31  2022-11-05 13:59:31        25          30.0
2022-11-05 13:58:31  2022-11-05 14:30:31        32          32.0

Explanation

You have the window set correctly, but if you leave the min_periods to the default it will assume the same as window (reference).

By changing the min_periods to 1, it will calculate the mean using 2 values in the window, except if there's only 1 then it will average that value with itself, e.g. the first value in the series --> 30/1 = 30.

Leaving it like this will result in the following.

df["rolling mean"] = df.rolling(
    window=2,
    min_periods=1
)["duration"].mean().shift(1)
                          delivered time  duration  rolling mean
order time                                                      
2022-11-05 13:03:31  2022-11-05 13:33:31        30           NaN
2022-11-05 13:20:31  2022-11-05 13:54:31        34          30.0
2022-11-05 13:34:31  2022-11-05 13:59:31        25          32.0
2022-11-05 13:58:31  2022-11-05 14:30:31        32          29.5

So we shift it by a period of 2 instead of 1 to get your desired output.

df["rolling mean"] = df.rolling(
    window=2,
    min_periods=1
)["duration"].mean().shift(periods=2)

                          delivered time  duration  rolling mean
order time                                                      
2022-11-05 13:03:31  2022-11-05 13:33:31        30           NaN
2022-11-05 13:20:31  2022-11-05 13:54:31        34           NaN
2022-11-05 13:34:31  2022-11-05 13:59:31        25          30.0
2022-11-05 13:58:31  2022-11-05 14:30:31        32          32.0

References

  • Related