I have the following dataframe and I want to sort by two columns - first by id and those then need to be sorted in date order, both in ascending order, with the earliest date first:
df = pd.DataFrame(
{
"id": [9733, 9733, 9733, 9733, 9733, 9733, 9733, 9733, 2949, 2949, 2949, 2949, 2949, 2949, 2949, 9765, 9765, 9765, 9765, 9765, 9765],
"date": ["01/05/2008", "01/06/1968", "01/11/2010", "01/12/2016", "09/03/2011", "09/05/1975", "28/04/2011", "29/01/2005", "08/07/1974", "09/03/2021", "10/03/2021", "18/09/1986", "20/07/2021", "23/09/2017", "26/06/2020", "01/04/1963", "01/08/2012", "02/08/2012", "25/06/2021", "30/11/2020", "31/03/1986",],
"status": ["S", "A", "P", "C", "S", "D", "P", "P", "A", "P", "S", "D", "P", "P", "S", "A", "S", "P", "P", "S", "P"],
}
)
First, I change the date column to datetime and apply the dd/mm/yyyy format:
df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].dt.strftime('%d/%m/%Y')
Then, I tried using the sort_values function:
df.sort_values(by=['id','date'], ascending=[True,True], inplace=True)
I'm getting the id column sorted in ascending order. However, the date column isn't in ascending order and this is where I can't work out where I'm going wrong:
Ultimately, I want the dataframe sorted by the earliest P status only (I want A status and D status separately which I figure I use .filter to get that, each will be output to an Excel file in separate worksheets), but I'm stuck with the ordering of the id and dates.
Thanks for your help.
CodePudding user response:
If you convert it to a string, the date will be sorted based on alphabetic/lexicographic order. so your data is sorted correctly, given the fact that you specify a string with dd/mm/yyyy
format.
If you don't want it sorted in that order, then sort the data first, then convert it to a string:
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['id','date'], ascending=[True,True], inplace=True)
df['date'] = df['date'].dt.strftime('%d/%m/%Y')
Alternatively, to numerically sort the string-formatted dates, you can provide a custom key
argument to pd.sort_values
:
In [8]: df.sort_values(
...: by=['id','date'],
...: ascending=[True,True],
...: inplace=True,
...: key=lambda col: (
...: pd.to_datetime(col, format="%d/%m/%Y")
...: if col.name == "date"
...: else col
...: ),
...: )
In [9]: df
Out[9]:
id date status
8 2949 08/07/1974 A
11 2949 18/09/1986 D
13 2949 23/09/2017 P
14 2949 26/06/2020 S
9 2949 09/03/2021 P
10 2949 10/03/2021 S
12 2949 20/07/2021 P
1 9733 01/06/1968 A
5 9733 09/05/1975 D
7 9733 29/01/2005 P
0 9733 01/05/2008 S
2 9733 01/11/2010 P
4 9733 09/03/2011 S
6 9733 28/04/2011 P
3 9733 01/12/2016 C
15 9765 01/04/1963 A
20 9765 31/03/1986 P
16 9765 01/08/2012 S
17 9765 02/08/2012 P
19 9765 30/11/2020 S
18 9765 25/06/2021 P
CodePudding user response:
if I understand well you question, you need to sort only when the column status has value P:
#1 convert the column date to datetime
df['date'] = pd.to_datetime(df['date'])
#2 split the dataframe to 2 dataframes
df1 = df.loc[df['status'].eq('P')].sort_values(by=['id','date'], ascending=[True,True])
df2 = df.loc[df['status'].ne('P')]
df3 = pd.concat([df1, df2], axis=0, ignore_index=True) #this is your new df