Groupby and combine and aggregate multiple groups into one single group based on condition-CodePudding

I have a hierarchical time series pandas DataFrame that involves multiple top-level entities. I want to combine and aggregate series that have less than 12 data points (i.e., the count of date is less than 12). Combine them so I can have more than 12 data points, and aggregate the values occurring on the same date.

Note that I don't want to combine those belonging to different level0 groups as they are unrelated (e.g., A and Z are two distinct entities). In other words, I would like to perform combination and aggregation on level2 within each level0 group. The original dataset after groupby is like this:

grp = df.groupby(['level0','level1','level2','date']).sum() #sum the values within the same date
                                                            #because each level2 entity can have multiple records occurring on the same date

                                        values
level0  level1  level2      date        
A       AA      AA_1        2006-10-31  300 # assume AA_1 have more than 12 data points so I dont't want to modify
                            2006-11-30  220
                            2006-12-31  415
                            ...         ...
                            2007-04-30  19
                            2007-05-31  77
                            2007-08-31  463
                AA_2        2006-04-30  600 # assume AA_2 has less than 12 data points
                            2006-05-31  2600
                            2007-09-30  6600
        AB      AB_1        2006-04-30  100 # assume AB_1 has less than 12 data points
                            2006-08-31  200
                            2007-06-30  300
                            2007-09-30  400
...     ...     ...         ...         ...
Z       ZZ      ZZ_9        2006-04-30  3680 # assume ZZ_9 has less than 12 data points
                            2006-09-30  277
                            2007-03-31  1490
                            2007-09-30  289
                            2007-10-31  387

I assume that both AA_2 and AB_1 that belong to group A have less than 12 data points so I want to combine them. They have two duplicated dates so for those two dates I want to sum up the value. After getting the new hierarchical group, I also want to drop the original ones.

Although ZZ_9 also has less than 12 data points, I won't combine it with the other two because ZZ_9 belongs to group Z!

The desired output is like this:

                                            values
level0  level1      level2      date        
A       AA          AA_1        2006-10-31  300
                                2006-11-30  220
                                2006-12-31  415
                                ...         ...
                                2007-04-30  19
                                2007-05-31  77
                                2007-08-31  463
        agg_lv1     agg_lv2     2006-04-30  700 (=600 100) # assume we have more than 12 data points now
                                                           # as I don't want the code being lengthy
                                2006-05-31  2600
                                2006-08-31  200
                                ...         ...
                                2007-06-30  300
                                2007-09-30  7000 (=6600 400)
...     ...         ...         ...         ...
Z       ZZ          ZZ_9        2006-04-30  3680
                                2006-09-30  277
                                2007-03-31  1490
                                2007-09-30  289
                                2007-10-31  387

It's alright that each level0 entity has the same name for the new aggregated levels (i.e., agg_lv1 and agg_lv2) because as mentioned level0 entities are unrelated and I want to keep the naming simple.

How can this be done?

CodePudding user response：

You can do this in multiple steps. First partition the dataframe into 2 where the first one contains all rows that need to be aggregated (both more than 12 time points and more than one level1 group).

grp = grp.reset_index()
grp['nunique'] = grp.groupby(['level0'])['level1'].transform('nunique')

# partition
grp_small = grp.loc[grp['nunique'] > 1].groupby(['level0', 'level1', 'level2']).filter(lambda x: len(x) < 12)
idx_small = grp_small.index
grp_large = grp.loc[set(grp.index) - set(idx_small)]

Now we can apply the sum aggregation on the grp_small dataframe while leaving grp_large as it is.

grp_small = grp_small.groupby(['level0', 'date'], as_index=False).sum()
grp_small[['level1', 'level2']] = ['agg_lv1', 'agg_lv2']

And finally, we concat the two dataframes together and apply some final postprocessing:

df = pd.concat([grp_large, grp_small], ignore_index=True)
df = df.drop(columns='nunique').set_index(['level0', 'level1', 'level2', 'date']).sort_index()

Result with the given data (with added rows to the first group during computation):

                                   values
level0 level1  level2  date              
A      AA      AA_1    2006-10-31     300
                       2006-11-30     220
                       2006-12-31     415
                       ...            ...
                       2007-04-30      19
                       2007-05-31      77
                       2007-08-31     463
       agg_lv1 agg_lv2 2006-04-30     700
                       2006-05-31    2600
                       2006-08-31     200
                       2007-06-30     300
                       2007-09-30    7000
...   ...       ...           ...     ...
Z      ZZ      ZZ_9    2006-04-30    3680
                       2006-09-30     277
                       2007-03-31    1490
                       2007-09-30     289
                       2007-10-31     387