Home > Software engineering >  How to sort dataframe in month year order?
How to sort dataframe in month year order?

Time:10-13

I am new to python, and I am trying to solve a problem sorting a df in month year format, and my original data just looks like this: (re edit: Sorry, after checking the original df, the month column is actually like: Aug, Jul, Jul, Jun. Not number format)

ID       year   month      ym
1        2018    Aug    Aug 2018
2        2018    Jul    Jul 2018
3        2019    Jul    Jul 2019
4        2019    Jun    Jun 2018

The way I figure out is ①split into two df according to year, and then ②sort month, finally ③merge them. But there're some problems:

for ①, I don't know how to do the split by groupby;

for ②, I tried to sort like this, it succeeds, but it looks like it's just temporarily sorted:

sort_order=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
df.index = pd.CategoricalIndex(df['month'], categories=sort_order, ordered=True)
df.sort_index().reset_index(drop=True)

for ③, I worry that because it's temporarily sorted, it will be the original one at the moment I try to merge they two.

I believe there should be better ways to solve this. Could anyone give a hint, or point out is there anything I got wrong? Thx!!

CodePudding user response:

according to your example (assuming your dataframe named df) just do the following:

df = df.sort_values(["year", "month"])

and this is the result :

    year    month   ym
1   2018    7   Jul 2018
0   2018    8   Aug 2018
3   2019    6   Jun 2018
2   2019    7   Jul 2019

CodePudding user response:

You can make a composite string with YYYY-mm, then sort according to this sequence.

We first convert your column ym in MMM YYYY string format to datetime format by pd.to_datetime then, use dt.strftime to format the date string in YYYY-mm. This format string with year at the beginning followed by month is good for sorting in chronological order.

df['YYYY-mm'] = pd.to_datetime(df['ym'], format='%b %Y').dt.strftime('%Y-%m')

df = df.sort_values('YYYY-mm')

Result:

print(df)


   ID  year month        ym  YYYY-mm
3   4  2019   Jun  Jun 2018  2018-06
1   2  2018   Jul  Jul 2018  2018-07
0   1  2018   Aug  Aug 2018  2018-08
2   3  2019   Jul  Jul 2019  2019-07

CodePudding user response:

If you want to work with date, I suggest you to work with a real DatetimeIndex

df = df.set_index(pd.to_datetime(df['ym']).rename('datetime'))
print(df)

# Output:
            ID  year  month        ym
datetime                             
2018-08-01   1  2018      8  Aug 2018
2018-07-01   2  2018      7  Jul 2018
2019-07-01   3  2019      7  Jul 2019
2018-06-01   4  2019      6  Jun 2018

Now you can easily sort your dataframe

>>> df.sort_index(ascending=False)
            ID  year  month        ym
datetime                             
2019-07-01   3  2019      7  Jul 2019
2018-08-01   1  2018      8  Aug 2018
2018-07-01   2  2018      7  Jul 2018
2018-06-01   4  2019      6  Jun 2018

Filter your dataframe:

>>> df[df.index > "2018-06"]
            ID  year  month        ym
datetime                             
2018-08-01   1  2018      8  Aug 2018
2018-07-01   2  2018      7  Jul 2018
2019-07-01   3  2019      7  Jul 2019

Group by year:

>>>  df.groupby(df.index.year)['ID'].sum()
datetime
2018    7
2019    3
Name: ID, dtype: int64
  • Related