I'm working with a dataset that contains the sale prices of different objects per day. What I would like to achieve is that for any given date in the dataset to compute the overall average for a fixed number of days preceding a sale (not including the sale date), ideally in a vectorized manner since the original dataset contains several million rows. The issue I'm faced with when using a conventional combination of a group by and rolling is that the aggregate numbers are different for the same date as illustrated below.
Setup
import pandas as pd
df = pd.DataFrame(
{
"date": pd.date_range("2021-11-01", "2021-12-01").to_list()
pd.date_range("2021-11-01", "2021-12-01").to_list(),
"price": ([31, 59, 60, 180, 16, 23, 90, 42, 70] * 7)[:-1],
}
).sort_values(by="date").reset_index(drop=True)
Gives
date price
0 2021-11-01 31
1 2021-11-01 16
2 2021-11-02 59
3 2021-11-02 23
4 2021-11-03 60
... ... ...
57 2021-11-29 59
58 2021-11-30 90
59 2021-11-30 60
60 2021-12-01 180
61 2021-12-01 42
What I've been trying is different variations along the lines of
df.set_index('date').rolling('3D', closed='right').mean()
Which gives the incorrect output, note the difference between the same-date values
mean
date
2021-11-01 31.000000
2021-11-01 23.500000
2021-11-02 35.333333
2021-11-02 32.250000
2021-11-03 37.800000
...
2021-11-29 63.166667
2021-11-30 43.800000
2021-11-30 46.500000
2021-12-01 82.400000
2021-12-01 75.666667
The desired solution can be achieved with a non-vectorized approach by:
df.date.map(dict(
(
x.date(),
df[(df.date >= x - pd.Timedelta(days=3)) & (df.date < x)].price.mean(),
)
for x in pd.date_range(df.date.min(), df.date.max())
))
mean
date
2021-11-01 NaN
2021-11-01 NaN
2021-11-02 23.500000
2021-11-02 23.500000
2021-11-03 32.250000
... ...
2021-11-29 66.500000
2021-11-30 63.166667
2021-11-30 63.166667
2021-12-01 46.500000
2021-12-01 46.500000
Any ideas on how to leverage vector operations?
CodePudding user response:
This will give you the dates grouped. But if you want the rows from the original df, you can merge this back into the original dataset
df.set_index('date').groupby('date').mean().rolling('3D', closed='left').mean()
price
date
2021-11-01 NaN
2021-11-02 23.500000
2021-11-03 32.250000
2021-11-04 46.500000
2021-11-05 75.666667
2021-11-06 76.333333
2021-11-07 60.333333
2021-11-08 48.166667
2021-11-09 50.833333
2021-11-10 83.500000
2021-11-11 66.500000
2021-11-12 63.166667
2021-11-13 46.500000
2021-11-14 75.666667
2021-11-15 76.333333
2021-11-16 60.333333
2021-11-17 48.166667
2021-11-18 50.833333
2021-11-19 83.500000
2021-11-20 66.500000
2021-11-21 63.166667
2021-11-22 46.500000
2021-11-23 75.666667
2021-11-24 76.333333
2021-11-25 60.333333
2021-11-26 48.166667
2021-11-27 50.833333
2021-11-28 83.500000
2021-11-29 66.500000
2021-11-30 63.166667
2021-12-01 46.500000
CodePudding user response:
2nd attempt - weighted averages This is similar to your non-vectorized code, but performs slightly better on the small data set. It's not tested against real data, so unclear if it'll produce the results you want. (ie not averages of averages)
# calculate average price and number of occurrences for each date
dfn = df.groupby('date').agg(avgprice=('price','mean'), counts=('price', 'count'))
dfn['w_avg'] = 0.0
win = 3
end = dfn.shape[0] 1
# calculate a weighted average
for i in range(win, end, 1):
dfr = dfn.iloc[i-win:i]
wavg = (dfr['avgprice'] * dfr['counts']).sum() / dfr['counts'].sum()
dfn.iat[i-win, 2] = wavg
dfn['w_avg'] = dfn['w_avg'].shift(win)
dfn
avgprice counts w_avg
date
2021-11-01 23.500 2 NaN
2021-11-02 41.000 2 NaN
2021-11-03 75.000 2 NaN
2021-11-04 111.000 2 46.500
2021-11-05 43.000 2 75.667
2021-11-06 27.000 2 76.333
2021-11-07 74.500 2 60.333
2021-11-08 51.000 2 48.167
2021-11-09 125.000 2 50.833
2021-11-10 23.500 2 83.500
2021-11-11 41.000 2 66.500
2021-11-12 75.000 2 63.167
2021-11-13 111.000 2 46.500
2021-11-14 43.000 2 75.667
2021-11-15 27.000 2 76.333
2021-11-16 74.500 2 60.333
2021-11-17 51.000 2 48.167
2021-11-18 125.000 2 50.833
2021-11-19 23.500 2 83.500
2021-11-20 41.000 2 66.500
2021-11-21 75.000 2 63.167
2021-11-22 111.000 2 46.500
2021-11-23 43.000 2 75.667
2021-11-24 27.000 2 76.333
2021-11-25 74.500 2 60.333
2021-11-26 51.000 2 48.167
2021-11-27 125.000 2 50.833
2021-11-28 23.500 2 83.500
2021-11-29 41.000 2 66.500
2021-11-30 75.000 2 63.167
2021-12-01 111.000 2 46.500