Dataframe sort_values on multiple columns with string format dates-CodePudding

I have the following dataframe and I want to sort by two columns - first by id and those then need to be sorted in date order, both in ascending order, with the earliest date first:

df = pd.DataFrame(
    {
        "id": [9733, 9733, 9733, 9733, 9733, 9733, 9733, 9733, 2949, 2949, 2949, 2949, 2949, 2949, 2949, 9765, 9765, 9765, 9765, 9765, 9765],
        "date": ["01/05/2008", "01/06/1968", "01/11/2010", "01/12/2016", "09/03/2011", "09/05/1975", "28/04/2011", "29/01/2005", "08/07/1974", "09/03/2021", "10/03/2021", "18/09/1986", "20/07/2021", "23/09/2017", "26/06/2020", "01/04/1963", "01/08/2012", "02/08/2012", "25/06/2021", "30/11/2020", "31/03/1986",],
        "status": ["S", "A", "P", "C", "S", "D", "P", "P", "A", "P", "S", "D", "P", "P", "S", "A", "S", "P", "P", "S", "P"],
    }
)

First, I change the date column to datetime and apply the dd/mm/yyyy format:

df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].dt.strftime('%d/%m/%Y')

Then, I tried using the sort_values function:

df.sort_values(by=['id','date'], ascending=[True,True], inplace=True)

I'm getting the id column sorted in ascending order. However, the date column isn't in ascending order and this is where I can't work out where I'm going wrong:

Ultimately, I want the dataframe sorted by the earliest P status only (I want A status and D status separately which I figure I use .filter to get that, each will be output to an Excel file in separate worksheets), but I'm stuck with the ordering of the id and dates.

Thanks for your help.

CodePudding user response：

If you convert it to a string, the date will be sorted based on alphabetic/lexicographic order. so your data is sorted correctly, given the fact that you specify a string with dd/mm/yyyy format.

If you don't want it sorted in that order, then sort the data first, then convert it to a string:

df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['id','date'], ascending=[True,True], inplace=True)
df['date'] = df['date'].dt.strftime('%d/%m/%Y')

Alternatively, to numerically sort the string-formatted dates, you can provide a custom key argument to pd.sort_values:

In [8]: df.sort_values(
   ...:     by=['id','date'],
   ...:     ascending=[True,True],
   ...:     inplace=True,
   ...:     key=lambda col: (
   ...:         pd.to_datetime(col, format="%d/%m/%Y")
   ...:         if col.name == "date"
   ...:         else col
   ...:     ),
   ...: )

In [9]: df
Out[9]:
      id        date status
8   2949  08/07/1974      A
11  2949  18/09/1986      D
13  2949  23/09/2017      P
14  2949  26/06/2020      S
9   2949  09/03/2021      P
10  2949  10/03/2021      S
12  2949  20/07/2021      P
1   9733  01/06/1968      A
5   9733  09/05/1975      D
7   9733  29/01/2005      P
0   9733  01/05/2008      S
2   9733  01/11/2010      P
4   9733  09/03/2011      S
6   9733  28/04/2011      P
3   9733  01/12/2016      C
15  9765  01/04/1963      A
20  9765  31/03/1986      P
16  9765  01/08/2012      S
17  9765  02/08/2012      P
19  9765  30/11/2020      S
18  9765  25/06/2021      P

CodePudding user response：

if I understand well you question, you need to sort only when the column status has value P:

#1 convert the column date to datetime
df['date'] = pd.to_datetime(df['date'])

#2 split the dataframe to 2 dataframes
df1 = df.loc[df['status'].eq('P')].sort_values(by=['id','date'], ascending=[True,True])
df2 = df.loc[df['status'].ne('P')]

df3 = pd.concat([df1, df2], axis=0, ignore_index=True) #this is your new df