I have a time series DataFrame involving multiple groups with a 3-level hierarchy (i.e., 3 id columns), together with a date column and a value column. The code I wrote to group them and an example of the result is shown below:
grp = df.groupby(['level0','level1','level2','date'])
values
level0 level1 level2 date
A AA AA_1 2006-10-31 300
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AA_2 2006-10-31 6630
2006-11-30 1980
2006-12-31 3367
2007-04-30 199
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387
I want to remove those in level2
who do not have any records in the last 6 months. Let's say Group A
has the maximum date of 2007-12-31, then I want to remove AA_2
because it doesn't have any record in the last 6 months. The desired output will be like this:
values
level0 level1 level2 date
A AA AA_1 2006-10-31 300
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387
I can extract the date range using the following code:
from dateutil.relativedelta import relativedelta
import pandas as pd
end_date = df.date.max()
start_date = end_date - relativedelta(months= 6 - 1)
test_period = pd.date_range(start=start_date, end=end_date, freq='1M').to_list()
[Timestamp('2007-07-31 00:00:00', freq='M'),
Timestamp('2007-08-31 00:00:00', freq='M'),
Timestamp('2007-09-30 00:00:00', freq='M'),
Timestamp('2007-10-31 00:00:00', freq='M'),
Timestamp('2007-11-30 00:00:00', freq='M'),
Timestamp('2007-12-31 00:00:00', freq='M')]
However, as each level0
group has a different maximum date (e.g., some collect info up to 2007-12-31 while some 2007-11-30), the code above finds the maximum date of the whole dataset which isn't correct for some groups.
My question is how I can find the maximum date in each level0
group, and delete those having no record at all in the last 6 months?
Thank you in advance! (Any solution is welcome although a speedy one is most desired!)
CodePudding user response:
First create helper DataFrame by MultiIndex.to_frame
with DataFrame.add_suffix
for rename columns names and for first level get maximal values, subtract 6 months and compare if at least one value of column is greater in s
, last test if per first 3 levels is at least one True
in GroupBy.transform
with GroupBy.any
and filter in boolean indexing
:
print (df)
values
level0 level1 level2 date
A AA AA_1 2007-12-31 300 <- date change
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AA_2 2006-10-31 6630
2006-11-30 1980
2006-12-31 3367
2007-04-30 199
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681
df1 = df.index.to_frame().add_suffix('_')
s = df1['date_'].gt(df1.groupby('level0')['date_']
.transform('max')
.sub(pd.offsets.DateOffset(months=6)))
print (s)
level0 level1 level2 date
A AA AA_1 2007-12-31 True
2006-11-30 False
2006-12-31 False
2007-04-30 False
2007-05-31 False
2007-08-31 True
AA_2 2006-10-31 False
2006-11-30 False
2006-12-31 False
2007-04-30 False
AB AB_1 2006-01-31 False
2006-05-31 False
2007-09-30 True
Name: date_, dtype: bool
df = df[s.groupby(['level0','level1','level2']).transform('any')]
print (df)
values
level0 level1 level2 date
A AA AA_1 2007-12-31 300
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681