Pandas Rolling Sum-CodePudding

I have the following data:

            country  objectid  objectuse
record_date
2022-07-20    chile         0          4
2022-07-01    chile         1          4
2022-07-02    chile         1          4
2022-07-03    chile         1          4
2022-07-04    chile         1          4
...             ...       ...        ...
2022-07-26     peru      3088          4
2022-07-27     peru      3088          4
2022-07-28     peru      3088          4
2022-07-30     peru      3088          4
2022-07-31     peru      3088          4

The data describes the daily usage of an object within a country for a single month (July 2022), and not all object are used every day.

I am interested in finding the sum of the monthly maximums for the month, easy enough:

df.groupby(['country', 'objectid']).max().groupby(level=0).sum()

         objectuse
country
chile         1224
peru         17008

However the thing that I am actually interested in is creating a rolling sum of the object maxima up to each day in the month within a country, for example:

            country       sum  
record_date
2022-07-01    chile         1
2022-07-01     peru         1
2022-07-02    chile         2
2022-07-02     peru         3
...             ...       ...
2022-07-31    chile       1224
2022-07-31     peru      17008

Is there a way to do this in pandas? I ultimately also want to do this via SQL but I was trying to find an approach that works first via pandas and without using the obvious approach which is looping and filtering the data based on the period that I am interested.

CodePudding user response：

You can use cumulative sum for this after filtering the countries out. You can probably reduce the redundancy in my code here by defining a function.

#create sample DataFrame
df=pd.DataFrame(['A','C','A','B','C','C','B','C'], columns=['country'])
df['sum1']=[4,7,9,5,8,2,5,9]

#filtering and using cumsum to add all
#nan values are used to skip those rows when using cumsum function
df['countryA']=[df.sum1[i] if df.country[i]=='A' else np.nan for i in range(len(df))]
df['countryA']=df.countryA.cumsum()
df['countryB']=[df.sum1[i] if df.country[i]=='B' else np.nan for i in range(len(df))]
df['countryB']=df.countryB.cumsum()
df['countryC']=[df.sum1[i] if df.country[i]=='C' else np.nan for i in range(len(df))]
df['countryC']=df.countryC.cumsum()

#replacing nan values to 0 to add all columns into 1
df.replace(np.nan,0,inplace=True)
df['sum_country']=df.countryA df.countryB df.countryC
df.drop(['countryA','countryB','countryC'],axis=1,inplace=True)

df

CodePudding user response：

You could try the following:

res = (
    df
    .set_index("country", append=True)
    .sort_index()
    .pivot(columns="objectid", values="objectuse")
    .fillna(0).astype("int")
    .groupby(level=1).cummax(axis=0)
    .sum(axis=1)
)

As an illustration I'm using the following sample dataframe:

from random import seed, randint, sample

seed(12345678)
objectids = list(range(10))
data = [
    [date, country, objectid, randint(1, 25)]
    for date in pd.date_range("2022-07-01", "2022-07-31", freq="1D")
    for country in ("chile", "peru")
    for objectid in sample(objectids, k=randint(0, 10))
]
df = (
    pd.DataFrame(data, columns=["record_date", "country", "objectid", "objectuse"])
    .set_index("record_date")
    .sample(frac=0.7, random_state=12345678)
    .sort_index()
)

            country  objectid  objectuse
record_date                             
2022-07-01     peru         1          5
2022-07-01     peru         6         12
2022-07-01    chile         9          8
2022-07-01     peru         8          4
2022-07-02     peru         0          8
...             ...       ...        ...
2022-07-31    chile         1         11
2022-07-31     peru         1          3
2022-07-31    chile         8         15
2022-07-31    chile         4          8
2022-07-31     peru         0         17

[210 rows x 3 columns]

Your first part

df.groupby(['country', 'objectid']).max().groupby(level=0).sum()

yields

         objectuse
country           
chile          231
peru           224

And the code above results in

record_date  country
2022-07-01   chile        8
             peru        21
2022-07-02   peru        44
2022-07-03   chile       68
             peru        46
2022-07-04   chile       83
             peru        87
2022-07-05   chile      148
             peru       118
2022-07-06   peru       128
...          ...        ...
2022-07-27   chile      228
             peru       221
2022-07-28   chile      228
             peru       221
2022-07-29   chile      231
2022-07-30   chile      231
             peru       221
2022-07-31   chile      231
             peru       224
dtype: int64