Home > Net >  Pandas: Sort dataframe correctly with German umlauts, upper/lowercase and numbers
Pandas: Sort dataframe correctly with German umlauts, upper/lowercase and numbers

Time:12-10

I have this dataframe (all strings):

      to_sort data
0     Belgien   a2
1      Zürich   b2
2    dänemark   c2
3          20   d2
4         100   e2
5  Österreich   f2

I want to sort it so that German umlauts are correct, also lowercase and numbers are correct:

      to_sort data
3          20   d2
4         100   e2
0     Belgien   a2
2    dänemark   c2
5  Österreich   f2
1      Zürich   b2

Here is my code to generate the dataframe and result:

import io, pandas as pd

t = io.StringIO("""
to_sort|data
Belgien|a2
Zürich|b2
dänemark|c2
20|d2
100|e2
Österreich|f2""")
df = pd.read_csv(t, sep='|')

df = df.sort_values(by='to_sort', key=lambda col: col.str.lower().str.normalize('NFD'))

The result is almost correct, but the numbers are sorted in the wrong order, 20 should be before 200:

      to_sort data
4         100   e2
3          20   d2
0     Belgien   a2
2    dänemark   c2
5  Österreich   f2
1      Zürich   b2

How can I fix the number sorting, while maintaining all the other characteristics?

CodePudding user response:

Just append .str.pad(max(col.str.len()):

key = lambda col: col.str.lower().str.normalize('NFD').str.pad(max(col.str.len()))
df = df.sort_values(by='to_sort', key=key)
print(df)

# Output
      to_sort data
3          20   d2
4         100   e2
0     Belgien   a2
1      Zürich   b2
2    dänemark   c2
5  Österreich   f2

CodePudding user response:

Use solution from last sample data in DataFrame.sort_values:

from natsort import index_natsorted

f = lambda col: np.argsort(index_natsorted(col.str.lower().str.normalize('NFD')))
df = df.sort_values(by='to_sort', key=f )
print (df)
      to_sort data
3          20   d2
4         100   e2
0     Belgien   a2
2    dänemark   c2
5  Österreich   f2
1      Zürich   b2
  • Related