I have the following
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
print(df)
print(df.sort_values(by=['col5']))
print(df.sort_values(by=['col2']))
print(df.sort_values(by='col2', key=lambda col: col.abs() ))
So far so good.
However I would like to order the dataframe by two columns: First col6 and then col5
However, with the following conditions:
- col6 only has to consider 4 decimals (meaning that
1.00005
and1.00001
should be consider equal - col6 should be considered as absolute (meaning
1.00005
is less than-2.12132
)
So the desired output would be
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
How can I combine the usage of keys with multiple columns?
CodePudding user response:
If you want to use arbitrary conditions on different columns, the easiest (ans most efficient) is to use numpy.lexsort
:
import numpy as np
out = df.iloc[np.lexsort([df['col5'].abs(), df['col6'].round(4)])]
NB. unlike sort_values
, the keys with higher priority are in the end with lexsort
.
If you really want to use sort_values
, you can use a custom function that choses the operation to apply depending on the Series name:
def sorter(s):
funcs = {
'col5': lambda s: s.abs(),
'col6': lambda s: s.round(4)
}
return funcs[s.name](s) if s.name in funcs else s
out = df.sort_values(by=['col6', 'col5'], key=sorter)
Output:
col1 col2 col3 col4 col5 col6
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
1 A -1 1 B 1 1.00001
4 D 7 2 e 7 1.00003
0 A 2 0 a 2 1.00005
provided example
reading again the question and the provided example, I think you might want:
df.iloc[np.lexsort([df['col5'], np.trunc(df['col6'].abs()*10**4)/10**4])]
Output:
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
CodePudding user response:
Proposed code :
Script ready to copy, paste and test
:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
r = df.sort_values(by=['col6', 'col5'], key=lambda c: c.apply(lambda x: abs(float(str(x)[:-1]))) if c.name=='col6' else c)
print(r)
Result :
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
Other coding style inspired from Mozway
I have read the inspiring @Mozway way.
Very interesting but like s is a serie you should use the following script ready to copy, paste and test
:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
def sorter(s):
funcs = {
'col5': lambda s: s,
'col6': lambda s: s.apply(lambda x: abs(float(str(x)[:-1])))
}
return funcs[s.name](s) if s.name in funcs else s
out = df.sort_values(by=['col6', 'col5'], key=sorter)
print(out)