Home > Mobile >  Rolling aggegrates per date in pandas with duplicate date entries
Rolling aggegrates per date in pandas with duplicate date entries

Time:12-04

I'm working with a dataset that contains the sale prices of different objects per day. What I would like to achieve is that for any given date in the dataset to compute the overall average for a fixed number of days preceding a sale (not including the sale date), ideally in a vectorized manner since the original dataset contains several million rows. The issue I'm faced with when using a conventional combination of a group by and rolling is that the aggregate numbers are different for the same date as illustrated below.

Setup

import pandas as pd

df = pd.DataFrame(
    {
        "date": pd.date_range("2021-11-01", "2021-12-01").to_list()
          pd.date_range("2021-11-01", "2021-12-01").to_list(),
        "price": ([31, 59, 60, 180, 16, 23, 90, 42, 70] * 7)[:-1],
    }
).sort_values(by="date").reset_index(drop=True)

Gives

    date    price
0   2021-11-01  31
1   2021-11-01  16
2   2021-11-02  59
3   2021-11-02  23
4   2021-11-03  60
... ... ...
57  2021-11-29  59
58  2021-11-30  90
59  2021-11-30  60
60  2021-12-01  180
61  2021-12-01  42

What I've been trying is different variations along the lines of

df.set_index('date').rolling('3D', closed='right').mean()

Which gives the incorrect output, note the difference between the same-date values

    mean
date    
2021-11-01    31.000000
2021-11-01    23.500000
2021-11-02    35.333333
2021-11-02    32.250000
2021-11-03    37.800000
                ...    
2021-11-29    63.166667
2021-11-30    43.800000
2021-11-30    46.500000
2021-12-01    82.400000
2021-12-01    75.666667

The desired solution can be achieved with a non-vectorized approach by:

df.date.map(dict(
    (
        x.date(),
        df[(df.date >= x - pd.Timedelta(days=3)) & (df.date < x)].price.mean(),
    )
    for x in pd.date_range(df.date.min(), df.date.max())
))

    mean
date    
2021-11-01  NaN
2021-11-01  NaN
2021-11-02  23.500000
2021-11-02  23.500000
2021-11-03  32.250000
... ...
2021-11-29  66.500000
2021-11-30  63.166667
2021-11-30  63.166667
2021-12-01  46.500000
2021-12-01  46.500000

Any ideas on how to leverage vector operations?

CodePudding user response:

This will give you the dates grouped. But if you want the rows from the original df, you can merge this back into the original dataset

df.set_index('date').groupby('date').mean().rolling('3D', closed='left').mean()

                price
date
2021-11-01        NaN
2021-11-02  23.500000
2021-11-03  32.250000
2021-11-04  46.500000
2021-11-05  75.666667
2021-11-06  76.333333
2021-11-07  60.333333
2021-11-08  48.166667
2021-11-09  50.833333
2021-11-10  83.500000
2021-11-11  66.500000
2021-11-12  63.166667
2021-11-13  46.500000
2021-11-14  75.666667
2021-11-15  76.333333
2021-11-16  60.333333
2021-11-17  48.166667
2021-11-18  50.833333
2021-11-19  83.500000
2021-11-20  66.500000
2021-11-21  63.166667
2021-11-22  46.500000
2021-11-23  75.666667
2021-11-24  76.333333
2021-11-25  60.333333
2021-11-26  48.166667
2021-11-27  50.833333
2021-11-28  83.500000
2021-11-29  66.500000
2021-11-30  63.166667
2021-12-01  46.500000

CodePudding user response:

2nd attempt - weighted averages This is similar to your non-vectorized code, but performs slightly better on the small data set. It's not tested against real data, so unclear if it'll produce the results you want. (ie not averages of averages)

# calculate average price and number of occurrences for each date
dfn = df.groupby('date').agg(avgprice=('price','mean'), counts=('price', 'count'))

dfn['w_avg'] = 0.0

win = 3
end = dfn.shape[0] 1

# calculate a weighted average
for i in range(win, end, 1):
    dfr = dfn.iloc[i-win:i]
    wavg = (dfr['avgprice'] * dfr['counts']).sum() / dfr['counts'].sum()
    dfn.iat[i-win, 2] = wavg
    
dfn['w_avg'] = dfn['w_avg'].shift(win)
dfn

            avgprice  counts  w_avg
date
2021-11-01    23.500       2    NaN
2021-11-02    41.000       2    NaN
2021-11-03    75.000       2    NaN
2021-11-04   111.000       2 46.500
2021-11-05    43.000       2 75.667
2021-11-06    27.000       2 76.333
2021-11-07    74.500       2 60.333
2021-11-08    51.000       2 48.167
2021-11-09   125.000       2 50.833
2021-11-10    23.500       2 83.500
2021-11-11    41.000       2 66.500
2021-11-12    75.000       2 63.167
2021-11-13   111.000       2 46.500
2021-11-14    43.000       2 75.667
2021-11-15    27.000       2 76.333
2021-11-16    74.500       2 60.333
2021-11-17    51.000       2 48.167
2021-11-18   125.000       2 50.833
2021-11-19    23.500       2 83.500
2021-11-20    41.000       2 66.500
2021-11-21    75.000       2 63.167
2021-11-22   111.000       2 46.500
2021-11-23    43.000       2 75.667
2021-11-24    27.000       2 76.333
2021-11-25    74.500       2 60.333
2021-11-26    51.000       2 48.167
2021-11-27   125.000       2 50.833
2021-11-28    23.500       2 83.500
2021-11-29    41.000       2 66.500
2021-11-30    75.000       2 63.167
2021-12-01   111.000       2 46.500
  • Related