I have this dataframe (all strings):
to_sort data
0 Belgien a2
1 Zürich b2
2 dänemark c2
3 20 d2
4 100 e2
5 Österreich f2
I want to sort it so that German umlauts are correct, also lowercase and numbers are correct:
to_sort data
3 20 d2
4 100 e2
0 Belgien a2
2 dänemark c2
5 Österreich f2
1 Zürich b2
Here is my code to generate the dataframe and result:
import io, pandas as pd
t = io.StringIO("""
to_sort|data
Belgien|a2
Zürich|b2
dänemark|c2
20|d2
100|e2
Österreich|f2""")
df = pd.read_csv(t, sep='|')
df = df.sort_values(by='to_sort', key=lambda col: col.str.lower().str.normalize('NFD'))
The result is almost correct, but the numbers are sorted in the wrong order, 20 should be before 200:
to_sort data
4 100 e2
3 20 d2
0 Belgien a2
2 dänemark c2
5 Österreich f2
1 Zürich b2
How can I fix the number sorting, while maintaining all the other characteristics?
CodePudding user response:
Just append .str.pad(max(col.str.len())
:
key = lambda col: col.str.lower().str.normalize('NFD').str.pad(max(col.str.len()))
df = df.sort_values(by='to_sort', key=key)
print(df)
# Output
to_sort data
3 20 d2
4 100 e2
0 Belgien a2
1 Zürich b2
2 dänemark c2
5 Österreich f2
CodePudding user response:
Use solution from last sample data in DataFrame.sort_values
:
from natsort import index_natsorted
f = lambda col: np.argsort(index_natsorted(col.str.lower().str.normalize('NFD')))
df = df.sort_values(by='to_sort', key=f )
print (df)
to_sort data
3 20 d2
4 100 e2
0 Belgien a2
2 dänemark c2
5 Österreich f2
1 Zürich b2