How to group a pandas DataFrame by month? Trying its output to have a index with actual last days-CodePudding

I have the following DataFrame, and like to group by month.

import pandas as pd
import numpy as np

idx = pd.date_range(start='2001-01-01', end='2002-01-01', periods = 80)
df = pd.DataFrame(np.random.rand(160).reshape(80,2), index=idx.normalize(), columns=['a','b'])

With the following codes, I can group df by month, but its index label is last days of each calendar month, but not the last days of each month in df.

k = df.resample('M').apply(lambda x: x[-1])

k1 = df.groupby(pd.Grouper(freq='M')).last()

For example, df.loc['2001-01'].index[-1] is Timestamp('2001-01-28 00:00:00'), but not Timestamp('2001-01-31 00:00:00'). However, k and k1 includes 2001-01-31 as below.

                   a         b
2001-01-31  0.521604  0.716046
2001-02-28  0.584479  0.560608
2001-03-31  0.201605  0.860491
2001-04-30  0.077426  0.711042
2001-05-31  0.544708  0.865880
2001-06-30  0.755516  0.863443
2001-07-31  0.266727  0.107859
2001-08-31  0.683754  0.098337
2001-09-30  0.586217  0.697163
2001-10-31  0.742394  0.160754
2001-11-30  0.655662  0.400128
2001-12-31  0.902192  0.580582
2002-01-31  0.878815  0.555669

In other words, I like to group df by month and the grouped df to have index labels of the last days of each month in df, but not the last dates of each calendar month.

CodePudding user response：

Let us try with duplicated after trim the index

df = df.sort_index()
out = df[~df.index.strftime('%Y-%m').duplicated(keep='last')]
Out[242]: 
                   a         b
2001-01-28  0.984408  0.923390
2001-02-25  0.108587  0.797240
2001-03-29  0.058016  0.025948
2001-04-26  0.095034  0.226460
2001-05-28  0.386954  0.419999
2001-06-30  0.535202  0.576777
2001-07-27  0.389711  0.706282
2001-08-29  0.270434  0.342087
2001-09-30  0.190336  0.872519
2001-10-28  0.333673  0.832585
2001-11-29  0.651579  0.751776
2001-12-27  0.649476  0.748410
2002-01-01  0.670143  0.389339

CodePudding user response：

@BENY's answer is great for this question. However, if you need to do something more with the group (e.g., calculate some aggregated statistics) then here is another idea that uses the groupby method:

df = df.reset_index()
df.groupby([(df["index"].dt.year),(df["index"].dt.month)]).last().set_index("index")