Home > Net >  df.groupby(df.index.to_period('M')).last() generates a new date index exceeding the range
df.groupby(df.index.to_period('M')).last() generates a new date index exceeding the range

Time:11-26

Given a sample data as follows:

import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 4)), index=dates, 
                  columns=['A_values', 'B_values', 'C_values', 'target'])
df

Out:

            A_values  B_values  C_values    target
2013-02-26  6.059783  7.333694  1.389472  3.126731
2013-02-27  9.972433  1.281624  1.789931  7.529254
2013-02-28  6.621605  7.843101  0.968944  0.585713
2013-03-01  9.623960  6.165574  0.866300  5.612724
2013-03-02  6.165247  9.638430  5.743043  3.711608
             ...       ...       ...       ...
2013-05-22  0.589729  6.479978  3.531450  6.872059
2013-05-23  6.279065  3.837670  8.853146  8.209883
2013-05-24  5.533017  5.241127  1.388056  5.355926
2013-05-25  1.596038  4.665995  2.406251  1.971875
2013-05-26  3.269001  1.787529  6.659690  7.545569

With code below, we can see the last row's index is outside of range of the initial date index (maximum date is 2013-05-26):

df.groupby(pd.Grouper(freq='M')).last()
Out[177]: 
            A_values  B_values  C_values    target
2013-02-28  6.621605  7.843101  0.968944  0.585713
2013-03-31  5.906967  8.545341  6.326550  8.684117
2013-04-30  5.358775  1.473809  5.231534  0.604810
2013-05-31  3.269001  1.787529  6.659690  7.545569

and:

df.groupby(df.index.to_period('M')).apply(lambda x: x.index.max())
Out[178]: 
2013-02   2013-02-28
2013-03   2013-03-31
2013-04   2013-04-30
2013-05   2013-05-26
Freq: M, dtype: datetime64[ns]

But I hope to get an expected result as follows, how could I do that? Thanks.

            A_values  B_values  C_values    target
2013-02-28  6.621605  7.843101  0.968944  0.585713
2013-03-31  5.906967  8.545341  6.326550  8.684117
2013-04-30  5.358775  1.473809  5.231534  0.604810
2013-05-26  3.269001  1.787529  6.659690  7.545569  # date should be `2013-05-26` based on the original data

CodePudding user response:

Idea is create helper column from DatetimeIndex and after last convert column to index:

df = (df.assign(new=df.index)
        .groupby(pd.Grouper(freq='M'))
        .last()
        .set_index('new')
        .rename_axis(None))
print (df)
            A_values  B_values  C_values    target
2013-02-28  6.621605  7.843101  0.968944  0.585713
2013-03-31  5.906967  8.545341  6.326550  8.684117
2013-04-30  5.358775  1.473809  5.231534  0.604810
2013-05-26  3.269001  1.787529  6.659690  7.545569
  • Related