I have the following data:
country objectid objectuse
record_date
2022-07-20 chile 0 4
2022-07-01 chile 1 4
2022-07-02 chile 1 4
2022-07-03 chile 1 4
2022-07-04 chile 1 4
... ... ... ...
2022-07-26 peru 3088 4
2022-07-27 peru 3088 4
2022-07-28 peru 3088 4
2022-07-30 peru 3088 4
2022-07-31 peru 3088 4
The data describes the daily usage of an object within a country for a single month (July 2022), and not all object are used every day.
I am interested in finding the sum of the monthly maximums for the month, easy enough:
df.groupby(['country', 'objectid']).max().groupby(level=0).sum()
objectuse
country
chile 1224
peru 17008
However the thing that I am actually interested in is creating a rolling sum of the object maxima up to each day in the month within a country, for example:
country sum
record_date
2022-07-01 chile 1
2022-07-01 peru 1
2022-07-02 chile 2
2022-07-02 peru 3
... ... ...
2022-07-31 chile 1224
2022-07-31 peru 17008
Is there a way to do this in pandas? I ultimately also want to do this via SQL but I was trying to find an approach that works first via pandas and without using the obvious approach which is looping and filtering the data based on the period that I am interested.
CodePudding user response:
You can use cumulative sum for this after filtering the countries out. You can probably reduce the redundancy in my code here by defining a function.
#create sample DataFrame
df=pd.DataFrame(['A','C','A','B','C','C','B','C'], columns=['country'])
df['sum1']=[4,7,9,5,8,2,5,9]
#filtering and using cumsum to add all
#nan values are used to skip those rows when using cumsum function
df['countryA']=[df.sum1[i] if df.country[i]=='A' else np.nan for i in range(len(df))]
df['countryA']=df.countryA.cumsum()
df['countryB']=[df.sum1[i] if df.country[i]=='B' else np.nan for i in range(len(df))]
df['countryB']=df.countryB.cumsum()
df['countryC']=[df.sum1[i] if df.country[i]=='C' else np.nan for i in range(len(df))]
df['countryC']=df.countryC.cumsum()
#replacing nan values to 0 to add all columns into 1
df.replace(np.nan,0,inplace=True)
df['sum_country']=df.countryA df.countryB df.countryC
df.drop(['countryA','countryB','countryC'],axis=1,inplace=True)
df
CodePudding user response:
You could try the following:
res = (
df
.set_index("country", append=True)
.sort_index()
.pivot(columns="objectid", values="objectuse")
.fillna(0).astype("int")
.groupby(level=1).cummax(axis=0)
.sum(axis=1)
)
As an illustration I'm using the following sample dataframe:
from random import seed, randint, sample
seed(12345678)
objectids = list(range(10))
data = [
[date, country, objectid, randint(1, 25)]
for date in pd.date_range("2022-07-01", "2022-07-31", freq="1D")
for country in ("chile", "peru")
for objectid in sample(objectids, k=randint(0, 10))
]
df = (
pd.DataFrame(data, columns=["record_date", "country", "objectid", "objectuse"])
.set_index("record_date")
.sample(frac=0.7, random_state=12345678)
.sort_index()
)
country objectid objectuse
record_date
2022-07-01 peru 1 5
2022-07-01 peru 6 12
2022-07-01 chile 9 8
2022-07-01 peru 8 4
2022-07-02 peru 0 8
... ... ... ...
2022-07-31 chile 1 11
2022-07-31 peru 1 3
2022-07-31 chile 8 15
2022-07-31 chile 4 8
2022-07-31 peru 0 17
[210 rows x 3 columns]
Your first part
df.groupby(['country', 'objectid']).max().groupby(level=0).sum()
yields
objectuse
country
chile 231
peru 224
And the code above results in
record_date country
2022-07-01 chile 8
peru 21
2022-07-02 peru 44
2022-07-03 chile 68
peru 46
2022-07-04 chile 83
peru 87
2022-07-05 chile 148
peru 118
2022-07-06 peru 128
... ... ...
2022-07-27 chile 228
peru 221
2022-07-28 chile 228
peru 221
2022-07-29 chile 231
2022-07-30 chile 231
peru 221
2022-07-31 chile 231
peru 224
dtype: int64