I have a dataframe where df.head(4) looks like this
order time | delivered time | duration |
---|---|---|
2022-11-05 13:03:31 | 2022-11-05 13:33:31 | 30 |
2022-11-05 13:20:31 | 2022-11-05 13:54:31 | 34 |
2022-11-05 13:34:31 | 2022-11-05 13:59:31 | 25 |
2022-11-05 13:58:31 | 2022-11-05 14:30:31 | 32 |
I want to calculate the rolling mean of the past 2 orders for each new order. However the rolling mean should not consider orders which has not been delivered yet
The last row rolling mean should be ((30 34)/2) = 32 and should not take 25 because this order has not been delivered when the last row order was placed
The resulting dataframe should look like this
order time | delivered time | duration | rolling mean |
---|---|---|---|
2022-11-05 13:03:31 | 2022-11-05 13:33:31 | 30 | NA |
2022-11-05 13:20:31 | 2022-11-05 13:54:31 | 34 | NA |
2022-11-05 13:34:31 | 2022-11-05 13:59:31 | 25 | 30 |
2022-11-05 13:58:31 | 2022-11-05 14:30:31 | 32 | 32 |
I tried the following code
def rolling(df):
dt=df['order time'].values
df['rolling mean'] = (df.set_axis(dt)
.rolling(2)['duration']
.mean()
.set_axis(df.index)
.shift(1)
)
return df
df_result = df.apply(rolling)
However, this is averaging the last 2 orders even if the order is not delivered yet
CodePudding user response:
This will get you the output you desire.
from io import StringIO
import pandas as pd # 1.5.2
string = StringIO("""order time delivered time duration
2022-11-05 13:03:31 2022-11-05 13:33:31 30
2022-11-05 13:20:31 2022-11-05 13:54:31 34
2022-11-05 13:34:31 2022-11-05 13:59:31 25
2022-11-05 13:58:31 2022-11-05 14:30:31 32""")
df = pd.read_csv(string, sep="\\t", engine="python")
df["order time"] = pd.to_datetime(df["order time"])
df = df.set_index("order time")
df["rolling mean"] = df.rolling(
window=2,
min_periods=1
)["duration"].mean().shift(periods=2)
print(df)
delivered time duration rolling mean
order time
2022-11-05 13:03:31 2022-11-05 13:33:31 30 NaN
2022-11-05 13:20:31 2022-11-05 13:54:31 34 NaN
2022-11-05 13:34:31 2022-11-05 13:59:31 25 30.0
2022-11-05 13:58:31 2022-11-05 14:30:31 32 32.0
Explanation
You have the window
set correctly, but if you leave the min_periods
to the default it will assume the same as window
(reference).
By changing the min_periods
to 1, it will calculate the mean using 2 values in the window, except if there's only 1 then it will average that value with itself, e.g. the first value in the series --> 30/1 = 30.
Leaving it like this will result in the following.
df["rolling mean"] = df.rolling(
window=2,
min_periods=1
)["duration"].mean().shift(1)
delivered time duration rolling mean
order time
2022-11-05 13:03:31 2022-11-05 13:33:31 30 NaN
2022-11-05 13:20:31 2022-11-05 13:54:31 34 30.0
2022-11-05 13:34:31 2022-11-05 13:59:31 25 32.0
2022-11-05 13:58:31 2022-11-05 14:30:31 32 29.5
So we shift it by a period of 2 instead of 1 to get your desired output.
df["rolling mean"] = df.rolling(
window=2,
min_periods=1
)["duration"].mean().shift(periods=2)
delivered time duration rolling mean
order time
2022-11-05 13:03:31 2022-11-05 13:33:31 30 NaN
2022-11-05 13:20:31 2022-11-05 13:54:31 34 NaN
2022-11-05 13:34:31 2022-11-05 13:59:31 25 30.0
2022-11-05 13:58:31 2022-11-05 14:30:31 32 32.0