Home > Back-end >  Pandas: How to custom-sort on multiple columns?
Pandas: How to custom-sort on multiple columns?

Time:11-29

I have a pandas dataframe with data like:

 ----------- ----------------- --------- 
| JOB-NAME  |  Status         | SLA     |
 ----------- ----------------- --------- 
| job_1     | YET_TO_START    | ---     |
| job_3     | COMPLETED       | MET     |
| job_4     | RUNNING         | MET     |
| job_2     | YET_TO_START    | LATE    |
| job_6     | RUNNING         | LATE    |
| job_5     | FAILED          | LATE    |
| job_7     | YET_TO_START    | ---     |
| job_8     | COMPLETED       | NOT_MET |
 ----------- ----------------- --------- 

I need to sort this table based on the Status and SLA states, like for Status: FAILED will be top on the table, then YET_TO_START, then RUNNING, and finally COMPLETED. Similarly for SLA the order will be LATE, ---, NOT_MET, and MET. Like this:

 ----------- ----------------- --------- 
| JOB-NAME  |  Status         | SLA     |
 ----------- ----------------- --------- 
| job_5     | FAILED          | LATE    |
| job_2     | YET_TO_START    | LATE    |
| job_1     | YET_TO_START    | ---     |
| job_7     | YET_TO_START    | ---     |
| job_6     | RUNNING         | LATE    |
| job_4     | RUNNING         | MET     |
| job_8     | COMPLETED       | NOT_MET |
| job_3     | COMPLETED       | MET     |
 ----------- ----------------- --------- 

I am able to do this custom sorting priority-based only on single column Status, but unable to do for multiple columns.

sort_order_dict = {"FAILED":0, "YET_TO_START":1, "RUNNING":2, "COMPLETED":3}
joined_df = joined_df.sort_values(by=['status'], key=lambda x: x.map(sort_order_dict))

A solution is given here, but its for single column, not multiple column.

CodePudding user response:

You can extend dictionary by values from another columns, only necessary different keys in both columns for correct working like mentioned mozway in comments:

sort_order_dict = {"FAILED":0, "YET_TO_START":1, "RUNNING":2, "COMPLETED":3,
                   "LATE":4, "---":5, "NOT_MET":6, "MET":7}
df = df.sort_values(by=['Status','SLA'], key=lambda x: x.map(sort_order_dict))
print (df)
  JOB-NAME        Status      SLA
5    job_5        FAILED     LATE
3    job_2  YET_TO_START     LATE
0    job_1  YET_TO_START      ---
6    job_7  YET_TO_START      ---
4    job_6       RUNNING     LATE
2    job_4       RUNNING      MET
7    job_8     COMPLETED  NOT_MET
1    job_3     COMPLETED      MET

Or use ordered Categorical:

df['Status'] = pd.Categorical(df['Status'], ordered=True, 
                              categories=['FAILED', 'YET_TO_START', 'RUNNING', 'COMPLETED'])
                                          
df['SLA'] = pd.Categorical(df['SLA'], ordered=True, 
                              categories= ['LATE', '---', 'NOT_MET', 'MET'])
df = df.sort_values(by=['Status','SLA'])
print (df)
  JOB-NAME        Status      SLA
5    job_5        FAILED     LATE
3    job_2  YET_TO_START     LATE
0    job_1  YET_TO_START      ---
6    job_7  YET_TO_START      ---
4    job_6       RUNNING     LATE
2    job_4       RUNNING      MET
7    job_8     COMPLETED  NOT_MET
1    job_3     COMPLETED      MET

CodePudding user response:

Use numpy.lexsort, you can use any number of parameters easily:

import numpy as np

sort_order_dict = {"FAILED":0, "YET_TO_START":1, "RUNNING":2, "COMPLETED":3}
sort_order_dict2 = {'LATE': 0, '---': 1, 'NOT_MET': 2, 'MET': 3}

order = np.lexsort([df['SLA'].map(sort_order_dict2),
                    df['Status'].map(sort_order_dict),
                    ])

out = df.iloc[order]

Output:

  JOB-NAME        Status      SLA
5    job_5        FAILED     LATE
3    job_2  YET_TO_START     LATE
0    job_1  YET_TO_START      ---
6    job_7  YET_TO_START      ---
4    job_6       RUNNING     LATE
2    job_4       RUNNING      MET
7    job_8     COMPLETED  NOT_MET
1    job_3     COMPLETED      MET
  • Related