Ordering a dataframe with a key function and multiple columns-CodePudding

I have the following

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col2': [2, -1, 9, -8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
    'col5': [2, 1, 9, 8, 7, 4],
    'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})

print(df)

print(df.sort_values(by=['col5']))

print(df.sort_values(by=['col2']))

print(df.sort_values(by='col2', key=lambda col: col.abs() ))

So far so good.

However I would like to order the dataframe by two columns: First col6 and then col5

However, with the following conditions:

col6 only has to consider 4 decimals (meaning that 1.00005 and 1.00001 should be consider equal
col6 should be considered as absolute (meaning 1.00005 is less than -2.12132)

So the desired output would be

  col1  col2  col3 col4  col5     col6
1    A    -1     1    B     1  1.00001
0    A     2     0    a     2  1.00005
4    D     7     2    e     7  1.00003
5    C     4     3    F     4 -2.12135
3  NaN    -8     4    D     8 -2.12137
2    B     9     9    c     9 -2.12132

How can I combine the usage of keys with multiple columns?

CodePudding user response：

If you want to use arbitrary conditions on different columns, the easiest (ans most efficient) is to use numpy.lexsort:

import numpy as np

out = df.iloc[np.lexsort([df['col5'].abs(), df['col6'].round(4)])]

NB. unlike sort_values, the keys with higher priority are in the end with lexsort.

If you really want to use sort_values, you can use a custom function that choses the operation to apply depending on the Series name:

def sorter(s):
    funcs = {
        'col5': lambda s: s.abs(),
        'col6': lambda s: s.round(4)
    }
    return funcs[s.name](s) if s.name in funcs else s
    
out = df.sort_values(by=['col6', 'col5'], key=sorter)

Output:

  col1  col2  col3 col4  col5     col6
5    C     4     3    F     4 -2.12135
3  NaN    -8     4    D     8 -2.12137
2    B     9     9    c     9 -2.12132
1    A    -1     1    B     1  1.00001
4    D     7     2    e     7  1.00003
0    A     2     0    a     2  1.00005

provided example

reading again the question and the provided example, I think you might want:

df.iloc[np.lexsort([df['col5'], np.trunc(df['col6'].abs()*10**4)/10**4])]

Output:

  col1  col2  col3 col4  col5     col6
1    A    -1     1    B     1  1.00001
0    A     2     0    a     2  1.00005
4    D     7     2    e     7  1.00003
5    C     4     3    F     4 -2.12135
3  NaN    -8     4    D     8 -2.12137
2    B     9     9    c     9 -2.12132

CodePudding user response：

Proposed code :

Script ready to copy, paste and test :

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col2': [2, -1, 9, -8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
    'col5': [2, 1, 9, 8, 7, 4],
    'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})

r = df.sort_values(by=['col6', 'col5'], key=lambda c: c.apply(lambda x: abs(float(str(x)[:-1]))) if c.name=='col6' else c)

print(r)

Result :

  col1  col2  col3 col4  col5     col6
1    A    -1     1    B     1  1.00001
0    A     2     0    a     2  1.00005
4    D     7     2    e     7  1.00003
5    C     4     3    F     4 -2.12135
3  NaN    -8     4    D     8 -2.12137
2    B     9     9    c     9 -2.12132

Other coding style inspired from Mozway

I have read the inspiring @Mozway way.

Very interesting but like s is a serie you should use the following script ready to copy, paste and test :

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col2': [2, -1, 9, -8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
    'col5': [2, 1, 9, 8, 7, 4],
    'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})

def sorter(s):
    funcs = {
        'col5': lambda s: s,
        'col6': lambda s: s.apply(lambda x: abs(float(str(x)[:-1])))
    }
    return funcs[s.name](s) if s.name in funcs else s
    
out = df.sort_values(by=['col6', 'col5'], key=sorter)

print(out)