I have a hierarchical time series pandas DataFrame that involves multiple top-level entities. I want to combine and aggregate series that have less than 12 data points (i.e., the count of date
is less than 12). Combine them so I can have more than 12 data points, and aggregate the values occurring on the same date
.
Note that I don't want to combine those belonging to different level0
groups as they are unrelated (e.g., A and Z are two distinct entities). In other words, I would like to perform combination and aggregation on level2
within each level0
group. The original dataset after groupby
is like this:
grp = df.groupby(['level0','level1','level2','date']).sum() #sum the values within the same date
#because each level2 entity can have multiple records occurring on the same date
values
level0 level1 level2 date
A AA AA_1 2006-10-31 300 # assume AA_1 have more than 12 data points so I dont't want to modify
2006-11-30 220
2006-12-31 415
... ...
2007-04-30 19
2007-05-31 77
2007-08-31 463
AA_2 2006-04-30 600 # assume AA_2 has less than 12 data points
2006-05-31 2600
2007-09-30 6600
AB AB_1 2006-04-30 100 # assume AB_1 has less than 12 data points
2006-08-31 200
2007-06-30 300
2007-09-30 400
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680 # assume ZZ_9 has less than 12 data points
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387
I assume that both AA_2
and AB_1
that belong to group A
have less than 12 data points so I want to combine them. They have two duplicated dates so for those two dates I want to sum up the value. After getting the new hierarchical group, I also want to drop the original ones.
Although ZZ_9
also has less than 12 data points, I won't combine it with the other two because ZZ_9
belongs to group Z
!
The desired output is like this:
values
level0 level1 level2 date
A AA AA_1 2006-10-31 300
2006-11-30 220
2006-12-31 415
... ...
2007-04-30 19
2007-05-31 77
2007-08-31 463
agg_lv1 agg_lv2 2006-04-30 700 (=600 100) # assume we have more than 12 data points now
# as I don't want the code being lengthy
2006-05-31 2600
2006-08-31 200
... ...
2007-06-30 300
2007-09-30 7000 (=6600 400)
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387
It's alright that each level0
entity has the same name for the new aggregated levels (i.e., agg_lv1
and agg_lv2
) because as mentioned level0
entities are unrelated and I want to keep the naming simple.
How can this be done?
CodePudding user response:
You can do this in multiple steps. First partition the dataframe into 2 where the first one contains all rows that need to be aggregated (both more than 12 time points and more than one level1
group).
grp = grp.reset_index()
grp['nunique'] = grp.groupby(['level0'])['level1'].transform('nunique')
# partition
grp_small = grp.loc[grp['nunique'] > 1].groupby(['level0', 'level1', 'level2']).filter(lambda x: len(x) < 12)
idx_small = grp_small.index
grp_large = grp.loc[set(grp.index) - set(idx_small)]
Now we can apply the sum
aggregation on the grp_small
dataframe while leaving grp_large
as it is.
grp_small = grp_small.groupby(['level0', 'date'], as_index=False).sum()
grp_small[['level1', 'level2']] = ['agg_lv1', 'agg_lv2']
And finally, we concat the two dataframes together and apply some final postprocessing:
df = pd.concat([grp_large, grp_small], ignore_index=True)
df = df.drop(columns='nunique').set_index(['level0', 'level1', 'level2', 'date']).sort_index()
Result with the given data (with added rows to the first group during computation):
values
level0 level1 level2 date
A AA AA_1 2006-10-31 300
2006-11-30 220
2006-12-31 415
... ...
2007-04-30 19
2007-05-31 77
2007-08-31 463
agg_lv1 agg_lv2 2006-04-30 700
2006-05-31 2600
2006-08-31 200
2007-06-30 300
2007-09-30 7000
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387