I have the following DataFrame, and like to group by month.
import pandas as pd
import numpy as np
idx = pd.date_range(start='2001-01-01', end='2002-01-01', periods = 80)
df = pd.DataFrame(np.random.rand(160).reshape(80,2), index=idx.normalize(), columns=['a','b'])
With the following codes, I can group df
by month, but its index label is last days of each calendar month, but not the last days of each month in df
.
k = df.resample('M').apply(lambda x: x[-1])
k1 = df.groupby(pd.Grouper(freq='M')).last()
For example, df.loc['2001-01'].index[-1]
is Timestamp('2001-01-28 00:00:00')
, but not Timestamp('2001-01-31 00:00:00')
. However, k
and k1
includes 2001-01-31
as below.
a b
2001-01-31 0.521604 0.716046
2001-02-28 0.584479 0.560608
2001-03-31 0.201605 0.860491
2001-04-30 0.077426 0.711042
2001-05-31 0.544708 0.865880
2001-06-30 0.755516 0.863443
2001-07-31 0.266727 0.107859
2001-08-31 0.683754 0.098337
2001-09-30 0.586217 0.697163
2001-10-31 0.742394 0.160754
2001-11-30 0.655662 0.400128
2001-12-31 0.902192 0.580582
2002-01-31 0.878815 0.555669
In other words, I like to group df
by month and the grouped df
to have index labels of the last days of each month in df
, but not the last dates of each calendar month.
CodePudding user response:
Let us try with duplicated
after trim the index
df = df.sort_index()
out = df[~df.index.strftime('%Y-%m').duplicated(keep='last')]
Out[242]:
a b
2001-01-28 0.984408 0.923390
2001-02-25 0.108587 0.797240
2001-03-29 0.058016 0.025948
2001-04-26 0.095034 0.226460
2001-05-28 0.386954 0.419999
2001-06-30 0.535202 0.576777
2001-07-27 0.389711 0.706282
2001-08-29 0.270434 0.342087
2001-09-30 0.190336 0.872519
2001-10-28 0.333673 0.832585
2001-11-29 0.651579 0.751776
2001-12-27 0.649476 0.748410
2002-01-01 0.670143 0.389339
CodePudding user response:
@BENY's answer is great for this question. However, if you need to do something more with the group (e.g., calculate some aggregated statistics) then here is another idea that uses the groupby
method:
df = df.reset_index()
df.groupby([(df["index"].dt.year),(df["index"].dt.month)]).last().set_index("index")