I'm trying to calculate the rolling mean/std for a column in dataframe. The pandas or numpy_ext rolling methods seem to need a fixed window size. The dataframe has a column "dates", I want to decide the window size based on the "dates", for example, when calculating mean/std, for rows at day 10, including rows from day 2 to day 6, for rows at day 11, including rows from day 3 to day 7, for rows at day 12, including rows from day 4 to day 8, etc.
I want to know if there are methods to do it except the brute force coding. Sample data, "quantity" is the target field to calculate mean and std.
dates | material | location | quantity |
---|---|---|---|
1 | C | A | 870 |
2 | C | A | 920 |
3 | C | A | 120 |
4 | C | A | 120 |
6 | C | A | 120 |
8 | D | A | 1200 |
8 | D | A | 720 |
10 | D | A | 480 |
11 | D | A | 600 |
12 | D | A | 720 |
13 | D | A | 80 |
13 | D | A | 600 |
14 | D | A | 1200 |
18 | E | B | 150 |
19 | E | B | 1416 |
20 | F | B | 1164 |
21 | G | B | 11520 |
For example, for each row, I want to get rolling mean for "quantity" of the previous 3-8 days (if any), the expected output will be:
| dates | material | location | quantity | Mean |
|-------|--------- |----------|----------|-------------------------|
| 1 | C | A | 870 | Nan |
| 2 | C | A | 920 | Nan |
| 3 | C | A | 120 | Nan |
| 4 | C | A | 120 | Nan |
| 6 | C | A | 120 |(870 920)/2 = 895 |
| 8 | D | A | 1200 |(870 920 120 120)/4=507.5|
| 8 | D | A | 720 |(870 920 120 120)/4=507.5|
| 10 | D | A | 480 |(920 120 120 120)/4=320 |
| 11 | D | A | 600 |(120 120 120)/3=120 |
| 12 | D | A | 720 |(120 120 1200 720)/4=540 |
| 13 | D | A | 80 |(120 1200 720)/3=680 |
| 13 | D | A | 600 |(120 1200 720)/3=680 |
| 14 | D | A | 1200 |(120 1200 720 480)/4=630 |
| 18 | E | B | 150|(480 600 720 80 600 1200)/6=613|
CodePudding user response:
The DataFrame constructor for anyone else to try:
d = {'dates': [1, 2, 3, 4, 6, 8, 8, 10, 11, 12, 13, 13, 14, 18],
'material': ['C','C','C','C','C','D','D','D','D','D','D','D','D','E'],
'location':['A','A','A','A','A','A','A','A','A','A','A','A','A','B'],
'quantity': [870, 920, 120, 120, 120, 1200, 720, 480, 600, 720, 80, 600, 1200, 150]}
df = pd.DataFrame(d)
df.rolling
does accept "a time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes."
So we would have to convert your days to datetimelike (e.g., a pd.Timestamp
, or a pd.Timedelta
), and set it as index.
But this method won't have the ability perform the shift that you want (e.g., for day 14 you want not up to day 14 but up to day 10: 4 days before it).
So there is another option, which df.rolling
also accepts:
Use a BaseIndexer
subclass
There is very little documentation on it and I'm not an expert, but I was able to hack my solution into it. Surely, there must be a better (proper) way to use all its attributes correctly, and hopefully someone will show it in their answer here.
How I did it:
Inside our BaseIndexer
subclass, we have to define the get_window_bounds method that returns a tuple of ndarrays: index positions of the starts of all windows, and those of the ends of all windows respectively (index positions like the ones that can be used in iloc
- not with loc
).
To to find them, I used the most efficient method from this answer: np.searchsorted
.
Your 'dates'
must be sorted for this.
Any keyword arguments that we pass to the BaseIndexer
subclass constructor will be set as its attributes. I will set day_from
, day_to
and days
:
from pandas.api.indexers import BaseIndexer
class CustomWindow(BaseIndexer):
"""
Indexer that selects the dates.
It uses the arguments:
----------------------
day_from : int
day_to : int
days : np.ndarray
"""
def get_window_bounds(self,
num_values: int,
min_periods: int | None,
center: bool | None,
closed: str | None) -> tuple[np.ndarray, np.ndarray]:
"""
I'm not using these arguments, but they must be present (not sure why):
`num_values` is the length of the df,
`center`: False, `closed`: None.
"""
days = self.days
# With `side` I'm making both ends inclusive:
window_starts = np.searchsorted(days, days self.day_from, side='left')
window_ends = np.searchsorted(days, days self.day_to, side='right')
return (window_starts, window_ends)
# In my implementation both ends are inclusive:
day_from = -8
day_to = -4
days = df['dates'].to_numpy()
my_indexer = CustomWindow(day_from=day_from, day_to=day_to, days=days)
df[['mean', 'std']] = (df['quantity']
.rolling(my_indexer, min_periods=0)
.agg(['mean', 'std']))
Result:
dates material location quantity mean std
0 1 C A 870 NaN NaN
1 2 C A 920 NaN NaN
2 3 C A 120 NaN NaN
3 4 C A 120 NaN NaN
4 6 C A 120 895.000000 35.355339
5 8 D A 1200 507.500000 447.911822
6 8 D A 720 507.500000 447.911822
7 10 D A 480 320.000000 400.000000
8 11 D A 600 120.000000 0.000011
9 12 D A 720 540.000000 523.067873
10 13 D A 80 680.000000 541.109970
11 13 D A 600 680.000000 541.109970
12 14 D A 1200 630.000000 452.990066
13 18 E B 150 613.333333 362.803896
CodePudding user response:
You can perform your operation with a rolling
, you however have to pre- and post-process the DataFrame a bit to generate the shift:
A = 3
B = 8
s = (df
# de-duplicate by getting the sum/count per identical date
.groupby('dates')['quantity']
.agg(['sum', 'count'])
# reindex to fill missing dates
.reindex(range(df['dates'].min(),
df['dates'].max() 1),
fill_value=0)
# compute classical rolling
.rolling(B-A, min_periods=1).sum()
# compute mean
.assign(mean=lambda d: d['sum']/d['count'])
['mean'].shift(A 1)
)
df['Mean'] = df['dates'].map(s)
output:
dates material location quantity Mean
0 1 C A 870 NaN
1 2 C A 920 NaN
2 3 C A 120 NaN
3 4 C A 120 NaN
4 6 C A 120 895.000000
5 8 D A 1200 507.500000
6 8 D A 720 507.500000
7 10 D A 480 320.000000
8 11 D A 600 120.000000
9 12 D A 720 540.000000
10 13 D A 80 680.000000
11 13 D A 600 680.000000
12 14 D A 1200 630.000000
13 18 E B 150 613.333333