Home > Software design >  Pandas rolling window cumsum
Pandas rolling window cumsum

Time:05-14

I have a pandas df as follows:

YEAR   MONTH   USERID    TRX_COUNT
2020   1        1         1
2020   2        1         2
2020   3        1         1
2020   12       1         1
2021   1        1         3
2021   2        1         3
2021   3        1         4

I want to sum the TRX_COUNT such that, each TRX_COUNT is the sum of TRX_COUNTS of the next 12 months. So my end result would look like

YEAR   MONTH   USERID    TRX_COUNT   TRX_COUNT_SUM
2020   1        1         1            5
2020   2        1         2            7
2020   3        1         1            8
2020   12       1         1            11
2021   1        1         3            10
2021   2        1         3            7
2021   3        1         4            4

For example TRX_COUNT_SUM for 2020/1 is 1 2 1 1=5 the count of the first 12 months.

Two areas I am confused how to proceed:

  1. I tried various variations of cumsum and grouping by USERID, YR, MONTH but am running into errors with handling the time window as there might be MONTHS where a user has no transactions and these have to be accounted for. For example in 2020/1 the user has no transactions for months 4-11, hence a full year of transaction count would be 5.

  2. Towards the end there will be partial years, which can be summed up and left as is (like 2021/3 which is left as 4).

Any thoughts on how to handle this? Thanks!

CodePudding user response:

I was able to accomplish this using a combination of numpy arrays, pandas, and indexing

import pandas as pd
import numpy as np

#df = your dataframe

df_dates = pd.DataFrame(np.arange(np.datetime64('2020-01-01'), np.datetime64('2021-04-01'), np.timedelta64(1, 'M'), dtype='datetime64[M]').astype('datetime64[D]'), columns = ['DATE'])
df_dates['YEAR'] = df_dates['DATE'].apply(lambda x : str(x).split('-')[0]).apply(lambda x : int(x))
df_dates['MONTH'] = df_dates['DATE'].apply(lambda x : str(x).split('-')[1]).apply(lambda x : int(x))

df_merge = df_dates.merge(df, how = 'left')
df_merge.replace(np.nan, 0, inplace=True)

df_merge.reset_index(inplace = True)

for i in range(0, len(df_merge)):
    max_index = df_merge['index'].max()
    if(i   11 < max_index):
        df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i:i   12]['TRX_COUNT'].sum()
    elif(i != max_index):
        df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i:max_index   1]['TRX_COUNT'].sum()
    else:
        df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i]['TRX_COUNT']

final_df = pd.merge(df_merge, df)

CodePudding user response:

Try this:

# Set the Dataframe index to a time series constructed from YEAR and MONTH
ts = pd.to_datetime(df.assign(DAY=1)[["YEAR", "MONTH", "DAY"]])
df.set_index(ts, inplace=True)

df["TRX_COUNT_SUM"] = (
    # Reindex the dataframe with every missing month in-between
    # Also reverse the index so that rolling(12) means 12 months
    # forward instead of backward
    df.reindex(pd.date_range(ts.min(), ts.max(), freq="MS")[::-1])
    # Roll and sum
    .rolling(12, min_periods=1)
    ["TRX_COUNT"].sum()
)
  • Related