Home > Blockchain >  Python remove subgroups in a groupby object based on condition
Python remove subgroups in a groupby object based on condition

Time:02-11

I have a time series DataFrame involving multiple groups with a 3-level hierarchy (i.e., 3 id columns), together with a date column and a value column. The code I wrote to group them and an example of the result is shown below:

grp = df.groupby(['level0','level1','level2','date'])

                                        values          
level0  level1  level2      date        
A       AA      AA_1        2006-10-31  300
                            2006-11-30  220
                            2006-12-31  415
                            2007-04-30  19
                            2007-05-31  77
                            2007-08-31  463
                AA_2        2006-10-31  6630
                            2006-11-30  1980
                            2006-12-31  3367
                            2007-04-30  199
        AB      AB_1        2006-01-31  693
                            2006-05-31  2694
                            2007-09-30  6681 
...     ...     ...         ...         ...
Z       ZZ      ZZ_9        2006-04-30  3680
                            2006-09-30  277
                            2007-03-31  1490
                            2007-09-30  289
                            2007-10-31  387

I want to remove those in level2 who do not have any records in the last 6 months. Let's say Group A has the maximum date of 2007-12-31, then I want to remove AA_2 because it doesn't have any record in the last 6 months. The desired output will be like this:

                                        values
level0  level1  level2      date        
A       AA      AA_1        2006-10-31  300
                            2006-11-30  220
                            2006-12-31  415
                            2007-04-30  19
                            2007-05-31  77
                            2007-08-31  463
        AB      AB_1        2006-01-31  693
                            2006-05-31  2694
                            2007-09-30  6681 
...     ...     ...         ...         ...
Z       ZZ      ZZ_9        2006-04-30  3680
                            2006-09-30  277
                            2007-03-31  1490
                            2007-09-30  289
                            2007-10-31  387

I can extract the date range using the following code:

from dateutil.relativedelta import relativedelta
import pandas as pd

end_date = df.date.max()
start_date = end_date - relativedelta(months= 6 - 1)
test_period = pd.date_range(start=start_date, end=end_date, freq='1M').to_list()

[Timestamp('2007-07-31 00:00:00', freq='M'),
 Timestamp('2007-08-31 00:00:00', freq='M'),
 Timestamp('2007-09-30 00:00:00', freq='M'),
 Timestamp('2007-10-31 00:00:00', freq='M'),
 Timestamp('2007-11-30 00:00:00', freq='M'),
 Timestamp('2007-12-31 00:00:00', freq='M')]

However, as each level0 group has a different maximum date (e.g., some collect info up to 2007-12-31 while some 2007-11-30), the code above finds the maximum date of the whole dataset which isn't correct for some groups.

My question is how I can find the maximum date in each level0 group, and delete those having no record at all in the last 6 months?

Thank you in advance! (Any solution is welcome although a speedy one is most desired!)

CodePudding user response:

First create helper DataFrame by MultiIndex.to_frame with DataFrame.add_suffix for rename columns names and for first level get maximal values, subtract 6 months and compare if at least one value of column is greater in s, last test if per first 3 levels is at least one True in GroupBy.transform with GroupBy.any and filter in boolean indexing:

print (df)
                                 values
level0 level1 level2 date              
A      AA     AA_1   2007-12-31     300 <- date change
                     2006-11-30     220
                     2006-12-31     415
                     2007-04-30      19
                     2007-05-31      77
                     2007-08-31     463
              AA_2   2006-10-31    6630
                     2006-11-30    1980
                     2006-12-31    3367
                     2007-04-30     199
       AB     AB_1   2006-01-31     693
                     2006-05-31    2694
                     2007-09-30    6681

df1 = df.index.to_frame().add_suffix('_')


s = df1['date_'].gt(df1.groupby('level0')['date_']
                      .transform('max')
                      .sub(pd.offsets.DateOffset(months=6)))
print (s)
level0  level1  level2  date      
A       AA      AA_1    2007-12-31     True
                        2006-11-30    False
                        2006-12-31    False
                        2007-04-30    False
                        2007-05-31    False
                        2007-08-31     True
                AA_2    2006-10-31    False
                        2006-11-30    False
                        2006-12-31    False
                        2007-04-30    False
        AB      AB_1    2006-01-31    False
                        2006-05-31    False
                        2007-09-30     True
Name: date_, dtype: bool

df = df[s.groupby(['level0','level1','level2']).transform('any')]

print (df)
                                 values
level0 level1 level2 date              
A      AA     AA_1   2007-12-31     300
                     2006-11-30     220
                     2006-12-31     415
                     2007-04-30      19
                     2007-05-31      77
                     2007-08-31     463
       AB     AB_1   2006-01-31     693
                     2006-05-31    2694
                     2007-09-30    6681
  • Related