pandas dataframe calculate rolling mean using cutomized window size-CodePudding

I'm trying to calculate the rolling mean/std for a column in dataframe. The pandas or numpy_ext rolling methods seem to need a fixed window size. The dataframe has a column "dates", I want to decide the window size based on the "dates", for example, when calculating mean/std, for rows at day 10, including rows from day 2 to day 6, for rows at day 11, including rows from day 3 to day 7, for rows at day 12, including rows from day 4 to day 8, etc.

I want to know if there are methods to do it except the brute force coding. Sample data, "quantity" is the target field to calculate mean and std.

dates	material	location	quantity
1	C	A	870
2	C	A	920
3	C	A	120
4	C	A	120
6	C	A	120
8	D	A	1200
8	D	A	720
10	D	A	480
11	D	A	600
12	D	A	720
13	D	A	80
13	D	A	600
14	D	A	1200
18	E	B	150
19	E	B	1416
20	F	B	1164
21	G	B	11520

For example, for each row, I want to get rolling mean for "quantity" of the previous 3-8 days (if any), the expected output will be:

| dates | material | location | quantity | Mean                    |
|-------|--------- |----------|----------|-------------------------|
| 1     | C        | A        | 870      | Nan                     |
| 2     | C        | A        | 920      | Nan                     |
| 3     | C        | A        | 120      | Nan                     |
| 4     | C        | A        | 120      | Nan                     |
| 6     | C        | A        | 120      |(870 920)/2 = 895        |
| 8     | D        | A        | 1200     |(870 920 120 120)/4=507.5|
| 8     | D        | A        | 720      |(870 920 120 120)/4=507.5|
| 10    | D        | A        | 480      |(920 120 120 120)/4=320  |
| 11    | D        | A        | 600      |(120 120 120)/3=120      |
| 12    | D        | A        | 720      |(120 120 1200 720)/4=540 |
| 13    | D        | A        | 80       |(120 1200 720)/3=680     |
| 13    | D        | A        | 600      |(120 1200 720)/3=680     |
| 14    | D        | A        | 1200     |(120 1200 720 480)/4=630 |
| 18    | E        | B       | 150|(480 600 720 80 600 1200)/6=613|

CodePudding user response：

The DataFrame constructor for anyone else to try:

d = {'dates': [1, 2, 3, 4, 6, 8, 8, 10, 11, 12, 13, 13, 14, 18],
     'material': ['C','C','C','C','C','D','D','D','D','D','D','D','D','E'],
     'location':['A','A','A','A','A','A','A','A','A','A','A','A','A','B'],
     'quantity': [870, 920, 120, 120, 120, 1200, 720, 480, 600, 720, 80, 600, 1200, 150]}
df = pd.DataFrame(d)

df.rolling does accept "a time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes."

So we would have to convert your days to datetimelike (e.g., a pd.Timestamp, or a pd.Timedelta), and set it as index.

But this method won't have the ability perform the shift that you want (e.g., for day 14 you want not up to day 14 but up to day 10: 4 days before it).

So there is another option, which df.rolling also accepts:

Use a `BaseIndexer` subclass

There is very little documentation on it and I'm not an expert, but I was able to hack my solution into it. Surely, there must be a better (proper) way to use all its attributes correctly, and hopefully someone will show it in their answer here.

How I did it:

Inside our BaseIndexer subclass, we have to define the get_window_bounds method that returns a tuple of ndarrays: index positions of the starts of all windows, and those of the ends of all windows respectively (index positions like the ones that can be used in iloc - not with loc).

To to find them, I used the most efficient method from this answer: np.searchsorted.

Your `'dates'` must be sorted for this.

Any keyword arguments that we pass to the BaseIndexer subclass constructor will be set as its attributes. I will set day_from, day_to and days:

from pandas.api.indexers import BaseIndexer

class CustomWindow(BaseIndexer):    
    """
    Indexer that selects the dates.

    It uses the arguments:
    ----------------------
    day_from : int
    day_to : int
    days : np.ndarray
    """
    def get_window_bounds(self,
                          num_values: int, 
                          min_periods: int | None,
                          center: bool | None, 
                          closed: str | None) -> tuple[np.ndarray, np.ndarray]:
        """
        I'm not using these arguments, but they must be present (not sure why):
            `num_values` is the length of the df,
            `center`: False, `closed`: None.
        """
        days = self.days
        # With `side` I'm making both ends inclusive:
        window_starts = np.searchsorted(days, days   self.day_from, side='left')  
        window_ends =  np.searchsorted(days, days   self.day_to, side='right')
        return (window_starts, window_ends)

# In my implementation both ends are inclusive:
day_from = -8
day_to = -4
days = df['dates'].to_numpy()

my_indexer = CustomWindow(day_from=day_from, day_to=day_to, days=days)
df[['mean', 'std']] = (df['quantity']
                       .rolling(my_indexer, min_periods=0)
                       .agg(['mean', 'std']))

Result:

    dates material location  quantity        mean         std
0       1        C        A       870         NaN         NaN
1       2        C        A       920         NaN         NaN
2       3        C        A       120         NaN         NaN
3       4        C        A       120         NaN         NaN
4       6        C        A       120  895.000000   35.355339
5       8        D        A      1200  507.500000  447.911822
6       8        D        A       720  507.500000  447.911822
7      10        D        A       480  320.000000  400.000000
8      11        D        A       600  120.000000    0.000011
9      12        D        A       720  540.000000  523.067873
10     13        D        A        80  680.000000  541.109970
11     13        D        A       600  680.000000  541.109970
12     14        D        A      1200  630.000000  452.990066
13     18        E        B       150  613.333333  362.803896

CodePudding user response：

You can perform your operation with a rolling, you however have to pre- and post-process the DataFrame a bit to generate the shift:

A = 3
B = 8
s = (df
  # de-duplicate by getting the sum/count per identical date
 .groupby('dates')['quantity']
 .agg(['sum', 'count'])
  # reindex to fill missing dates
 .reindex(range(df['dates'].min(),
                df['dates'].max() 1),
         fill_value=0)
  # compute classical rolling
 .rolling(B-A, min_periods=1).sum()
 # compute mean
 .assign(mean=lambda d: d['sum']/d['count'])
 ['mean'].shift(A 1)
 )

df['Mean'] = df['dates'].map(s)

output:

    dates material location  quantity        Mean
0       1        C        A       870         NaN
1       2        C        A       920         NaN
2       3        C        A       120         NaN
3       4        C        A       120         NaN
4       6        C        A       120  895.000000
5       8        D        A      1200  507.500000
6       8        D        A       720  507.500000
7      10        D        A       480  320.000000
8      11        D        A       600  120.000000
9      12        D        A       720  540.000000
10     13        D        A        80  680.000000
11     13        D        A       600  680.000000
12     14        D        A      1200  630.000000
13     18        E        B       150  613.333333

Use a BaseIndexer subclass

Your 'dates' must be sorted for this.

Use a `BaseIndexer` subclass

Your `'dates'` must be sorted for this.