A given data set is ordered by [Dates
] and grouped by [Code
].
For each set of Value
's (grouped by Code
), I would like to sum the values 9 months in advance. For instance, row 0's new value would be the summation of all rows with the same code "123" that are 9 months in advance which is rows 0,1 & 2 so the sum of row 0's new value is 1 2 3. Here is a sample of the data for better clarity.
Dates Value Code Value(rolling 9Months)
0 01/31/2021 1 123 (1 2 3)
1 03/31/2021 2 123 (2 3 4)
2 09/30/2021 3 123 (3 4)
3 12/31/2021 4 123 (4)
4 02/28/2021 1 345 (1 2 5)
5 05/31/2021 2 345 (2 5)
6 09/30/2021 5 345 (5)
I figure the pseudo code would read something like this (apologies for the novitiate) --
While Code is constant
If (datei 1 – datei) <= 9
Then sum values with same code
Happy to elaborate on any point, thank you greatly for any help!
CodePudding user response:
Here's one idea: Basically, reindex each group into monthly intervals (fill the extra rows with 0s), compute the rolling sum with a window of 10 months, then merge the result with the original dataframe so only the desired rows remain.
import pandas as pd
from pandas import Timestamp
# setup
df = pd.DataFrame({'Dates': [Timestamp('2021-01-31 00:00:00'),
Timestamp('2021-03-31 00:00:00'),
Timestamp('2021-09-30 00:00:00'),
Timestamp('2021-12-31 00:00:00'),
Timestamp('2021-02-28 00:00:00'),
Timestamp('2021-05-31 00:00:00'),
Timestamp('2021-09-30 00:00:00')],
'Value': [1, 2, 3, 4, 1, 2, 5],
'Code': [123, 123, 123, 123, 345, 345, 345]})
def get_rolling_sum(frame):
# set the index to monthly values and fill NaNs with 0s
frame = frame.reindex(
pd.date_range(frame.index.min(), frame.index.max(), freq="M")
).fillna(0)
# compute the rolling sum of 10-month-long intervals on the reverse
return frame[::-1].rolling(window=10, min_periods=1).sum()
df_tmp = (df
.set_index("Dates") # set Dates as index for the reindexing
.groupby("Code")
.Value.apply(get_rolling_sum)
.reset_index() # turn the Dates and Code index into regular columns again
.rename(columns={"level_1": "Dates", "Value": "Value9M"})
)
Finally, merge the result with the original dataframe:
>>> df.merge(df_tmp, on=["Dates", "Code"])
Dates Value Code Value9M
0 2021-01-31 1 123 6.0
1 2021-03-31 2 123 9.0
2 2021-09-30 3 123 7.0
3 2021-12-31 4 123 4.0
4 2021-02-28 1 345 8.0
5 2021-05-31 2 345 7.0
6 2021-09-30 5 345 5.0
CodePudding user response:
N.B. The following will work only if each month per unique Code
value has only one associated date. E.g. it won't work, if you could have both '01/30/2021'
and '01/31/2021'
for 123
.
Code
import pandas as pd
data = {'Dates': {0: '01/31/2021', 1: '03/31/2021', 2: '09/30/2021',
3: '12/31/2021', 4: '02/28/2021', 5: '05/31/2021',
6: '09/30/2021'},
'Value': {0: 1, 1: 2, 2: 3, 3: 4, 4: 1, 5: 2, 6: 5},
'Code': {0: 123, 1: 123, 2: 123, 3: 123, 4: 345, 5: 345, 6: 345}}
df = pd.DataFrame(data)
# turn col `Dates` into datetime
df['Dates'] = pd.to_datetime(df['Dates'])
# set `Dates` as index
df.set_index('Dates', drop=True, inplace=True)
def roll(x):
# flip x (a `pd.Series` for col `Value`), apply rolling, and sum
# window == 279D (i.e. 9*31)
# min_periods == 1 (actually, the default)
return x[::-1].rolling('279D', min_periods=1).sum()
# first turn all dates into `EoM`, groupby `Code`, select col `Value`
# call func `roll` inside `transform` to ensure result respects
# order of orig index; copy only `to_numpy`, since index will not match,
# (but it will be the same order)
df['Value_rolling_9months'] = df.to_period('M').to_timestamp('M')\
.groupby('Code')['Value'].transform(roll).to_numpy()
print(df)
Value Code Value_rolling_9months
Dates
2021-01-31 1 123 6.0
2021-03-31 2 123 9.0
2021-09-30 3 123 7.0
2021-12-31 4 123 4.0
2021-02-28 1 345 8.0
2021-05-31 2 345 7.0
2021-09-30 5 345 5.0
If you 100% sure that your dates are always already EoM
(as in your example), then you could simply use:
df['Value_rolling_9months'] = df.groupby('Code')['Value'].transform(roll)
Simplified example
We are flipping the frame for each group in df.Code
. E.g.:
print(df.loc[df.Code == 345, 'Value'][::-1])
Dates
2021-09-30 5
2021-05-31 2
2021-02-28 1
Name: Value, dtype: int64
Now, we want to apply .rolling
with a window of "9 months"; unfortunately, if you try to roll on one or more months, you will receive an error. E.g.
df.loc[df.Code == 345, 'Value'][::-1].rolling('9M')
# ValueError: <9 * MonthEnds> is a non-fixed frequency
For the reason, see this SO post
. To circumvent this issue, we can roll by dates. Since we have only EoM
values, we can do this safely; there will always be a considerable gap between two consecutive dates.
To see what is actually happening with rolling
, we can list
the procedure. E.g.:
list(df.loc[df.Code == 345, 'Value'][::-1].rolling('279D', min_periods=1))
[Dates
2021-09-30 5
Name: Value, dtype: int64,
Dates
2021-09-30 5
2021-05-31 2
Name: Value, dtype: int64,
Dates
2021-09-30 5
2021-05-31 2
2021-02-28 1
Name: Value, dtype: int64]
For each of these windows
, we are asking the sum
. The parameter min_periods=1
means that we allow summing at a minimum of 1 observation
. E.g., suppose we would set it to 2
, then we would get a NaN value
returned for the first window.
So, here the sum will get us 5, 7, 8
. Note that these values would be in the incorrect order, we actually need 8, 7, 5
. This is where DataFrameGroupBy.transform
comes in: it ensures that the result is returned in the same order as the original index.