Home > Enterprise >  conditional while loop to sum subsequent rows of data based on column values in Python
conditional while loop to sum subsequent rows of data based on column values in Python

Time:09-15

A given data set is ordered by [Dates] and grouped by [Code].

For each set of Value's (grouped by Code), I would like to sum the values 9 months in advance. For instance, row 0's new value would be the summation of all rows with the same code "123" that are 9 months in advance which is rows 0,1 & 2 so the sum of row 0's new value is 1 2 3. Here is a sample of the data for better clarity.

   Dates      Value  Code  Value(rolling 9Months) 
0  01/31/2021   1      123   (1 2 3)
1  03/31/2021   2      123   (2 3 4)
2  09/30/2021   3      123   (3 4)
3  12/31/2021   4      123   (4)
4  02/28/2021   1      345   (1 2 5)
5  05/31/2021   2      345   (2 5)
6  09/30/2021   5      345   (5)

I figure the pseudo code would read something like this (apologies for the novitiate) --

While Code is constant
    If (datei 1 – datei) <= 9
    Then sum values with same code

Happy to elaborate on any point, thank you greatly for any help!

CodePudding user response:

Here's one idea: Basically, reindex each group into monthly intervals (fill the extra rows with 0s), compute the rolling sum with a window of 10 months, then merge the result with the original dataframe so only the desired rows remain.

import pandas as pd
from pandas import Timestamp

# setup
df = pd.DataFrame({'Dates': [Timestamp('2021-01-31 00:00:00'),
                             Timestamp('2021-03-31 00:00:00'),
                             Timestamp('2021-09-30 00:00:00'),
                             Timestamp('2021-12-31 00:00:00'),
                             Timestamp('2021-02-28 00:00:00'),
                             Timestamp('2021-05-31 00:00:00'),
                             Timestamp('2021-09-30 00:00:00')],
                             'Value': [1, 2, 3, 4, 1, 2, 5],
                             'Code': [123, 123, 123, 123, 345, 345, 345]})
def get_rolling_sum(frame):
    
    # set the index to monthly values and fill NaNs with 0s
    frame = frame.reindex(
        pd.date_range(frame.index.min(), frame.index.max(), freq="M")
    ).fillna(0)

    # compute the rolling sum of 10-month-long intervals on the reverse
    return frame[::-1].rolling(window=10, min_periods=1).sum()

df_tmp = (df
          .set_index("Dates") # set Dates as index for the reindexing
          .groupby("Code")
          .Value.apply(get_rolling_sum)
          .reset_index() # turn the Dates and Code index into regular columns again
          .rename(columns={"level_1": "Dates", "Value": "Value9M"})
         )

Finally, merge the result with the original dataframe:

>>> df.merge(df_tmp, on=["Dates", "Code"])
       Dates  Value  Code  Value9M
0 2021-01-31      1   123      6.0
1 2021-03-31      2   123      9.0
2 2021-09-30      3   123      7.0
3 2021-12-31      4   123      4.0
4 2021-02-28      1   345      8.0
5 2021-05-31      2   345      7.0
6 2021-09-30      5   345      5.0

CodePudding user response:

N.B. The following will work only if each month per unique Code value has only one associated date. E.g. it won't work, if you could have both '01/30/2021' and '01/31/2021' for 123.

Code

import pandas as pd

data = {'Dates': {0: '01/31/2021', 1: '03/31/2021', 2: '09/30/2021',
                  3: '12/31/2021', 4: '02/28/2021', 5: '05/31/2021', 
                  6: '09/30/2021'},
        'Value': {0: 1, 1: 2, 2: 3, 3: 4, 4: 1, 5: 2, 6: 5},
        'Code': {0: 123, 1: 123, 2: 123, 3: 123, 4: 345, 5: 345, 6: 345}}

df = pd.DataFrame(data)

# turn col `Dates` into datetime
df['Dates'] = pd.to_datetime(df['Dates'])

# set `Dates` as index
df.set_index('Dates', drop=True, inplace=True)

def roll(x):
    # flip x (a `pd.Series` for col `Value`), apply rolling, and sum
    # window == 279D (i.e. 9*31)
    # min_periods == 1 (actually, the default)
    return x[::-1].rolling('279D', min_periods=1).sum()

# first turn all dates into `EoM`, groupby `Code`, select col `Value`
# call func `roll` inside `transform` to ensure result respects 
# order of orig index; copy only `to_numpy`, since index will not match,
# (but it will be the same order)
df['Value_rolling_9months'] = df.to_period('M').to_timestamp('M')\
    .groupby('Code')['Value'].transform(roll).to_numpy()

print(df)

            Value  Code  Value_rolling_9months
Dates                                         
2021-01-31      1   123                    6.0
2021-03-31      2   123                    9.0
2021-09-30      3   123                    7.0
2021-12-31      4   123                    4.0
2021-02-28      1   345                    8.0
2021-05-31      2   345                    7.0
2021-09-30      5   345                    5.0

If you 100% sure that your dates are always already EoM (as in your example), then you could simply use:

df['Value_rolling_9months'] = df.groupby('Code')['Value'].transform(roll)

Simplified example

We are flipping the frame for each group in df.Code. E.g.:

print(df.loc[df.Code == 345, 'Value'][::-1])

Dates
2021-09-30    5
2021-05-31    2
2021-02-28    1
Name: Value, dtype: int64

Now, we want to apply .rolling with a window of "9 months"; unfortunately, if you try to roll on one or more months, you will receive an error. E.g.

df.loc[df.Code == 345, 'Value'][::-1].rolling('9M')
# ValueError: <9 * MonthEnds> is a non-fixed frequency

For the reason, see this SO post. To circumvent this issue, we can roll by dates. Since we have only EoM values, we can do this safely; there will always be a considerable gap between two consecutive dates.

To see what is actually happening with rolling, we can list the procedure. E.g.:

list(df.loc[df.Code == 345, 'Value'][::-1].rolling('279D', min_periods=1))

[Dates
 2021-09-30    5
 Name: Value, dtype: int64,
 Dates
 2021-09-30    5
 2021-05-31    2
 Name: Value, dtype: int64,
 Dates
 2021-09-30    5
 2021-05-31    2
 2021-02-28    1
 Name: Value, dtype: int64]

For each of these windows, we are asking the sum. The parameter min_periods=1 means that we allow summing at a minimum of 1 observation. E.g., suppose we would set it to 2, then we would get a NaN value returned for the first window.

So, here the sum will get us 5, 7, 8. Note that these values would be in the incorrect order, we actually need 8, 7, 5. This is where DataFrameGroupBy.transform comes in: it ensures that the result is returned in the same order as the original index.

  • Related