I am new to python, and I am trying to solve a problem sorting a df in month year format, and my original data just looks like this: (re edit: Sorry, after checking the original df, the month column is actually like: Aug, Jul, Jul, Jun. Not number format)
ID year month ym
1 2018 Aug Aug 2018
2 2018 Jul Jul 2018
3 2019 Jul Jul 2019
4 2019 Jun Jun 2018
The way I figure out is ①split into two df according to year, and then ②sort month, finally ③merge them. But there're some problems:
for ①, I don't know how to do the split by groupby;
for ②, I tried to sort like this, it succeeds, but it looks like it's just temporarily sorted:
sort_order=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
df.index = pd.CategoricalIndex(df['month'], categories=sort_order, ordered=True)
df.sort_index().reset_index(drop=True)
for ③, I worry that because it's temporarily sorted, it will be the original one at the moment I try to merge they two.
I believe there should be better ways to solve this. Could anyone give a hint, or point out is there anything I got wrong? Thx!!
CodePudding user response:
according to your example (assuming your dataframe named df) just do the following:
df = df.sort_values(["year", "month"])
and this is the result :
year month ym
1 2018 7 Jul 2018
0 2018 8 Aug 2018
3 2019 6 Jun 2018
2 2019 7 Jul 2019
CodePudding user response:
You can make a composite string with YYYY-mm
, then sort according to this sequence.
We first convert your column ym
in MMM YYYY
string format to datetime format by pd.to_datetime
then, use dt.strftime
to format the date string in YYYY-mm
. This format string with year at the beginning followed by month is good for sorting in chronological order.
df['YYYY-mm'] = pd.to_datetime(df['ym'], format='%b %Y').dt.strftime('%Y-%m')
df = df.sort_values('YYYY-mm')
Result:
print(df)
ID year month ym YYYY-mm
3 4 2019 Jun Jun 2018 2018-06
1 2 2018 Jul Jul 2018 2018-07
0 1 2018 Aug Aug 2018 2018-08
2 3 2019 Jul Jul 2019 2019-07
CodePudding user response:
If you want to work with date, I suggest you to work with a real DatetimeIndex
df = df.set_index(pd.to_datetime(df['ym']).rename('datetime'))
print(df)
# Output:
ID year month ym
datetime
2018-08-01 1 2018 8 Aug 2018
2018-07-01 2 2018 7 Jul 2018
2019-07-01 3 2019 7 Jul 2019
2018-06-01 4 2019 6 Jun 2018
Now you can easily sort your dataframe
>>> df.sort_index(ascending=False)
ID year month ym
datetime
2019-07-01 3 2019 7 Jul 2019
2018-08-01 1 2018 8 Aug 2018
2018-07-01 2 2018 7 Jul 2018
2018-06-01 4 2019 6 Jun 2018
Filter your dataframe:
>>> df[df.index > "2018-06"]
ID year month ym
datetime
2018-08-01 1 2018 8 Aug 2018
2018-07-01 2 2018 7 Jul 2018
2019-07-01 3 2019 7 Jul 2019
Group by year:
>>> df.groupby(df.index.year)['ID'].sum()
datetime
2018 7
2019 3
Name: ID, dtype: int64