Home > database >  Sort DataFrame based on part of its index
Sort DataFrame based on part of its index

Time:08-24

What I would like to achieve

I have a DataFrame whose indices are "ID (int) underscore (_) name (str)". I would like to sort the data based on the ID.

import pandas as pd

data = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14]]
index = ['11_ItemName', '0_ItemName', '1_ItemName', '2_ItemName', '10_ItemName', '20_ItemName', '101_ItemName']
columns = ['column1', 'column2']

df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Output
#              column1  column2
# 11_ItemName        1        2
# 0_ItemName         3        4
# 1_ItemName         5        6
# 2_ItemName         7        8
# 10_ItemName        9       10
# 20_ItemName       11       12
# 101_ItemName      13       14

# print(DO SOMETHING!)
## Expected output
#              column1  column2
# 0_ItemName         3        4
# 1_ItemName         5        6
# 2_ItemName         7        8
# 10_ItemName        9       10
# 11_ItemName        1        2
# 20_ItemName       11       12
# 101_ItemName      13       14

What I tested

I tried to use sort_index and failed.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html

To be honest, I'm stuck at the debug of lambda.

df(sort_index())
## Output
#              column1  column2
# 0_ItemName         3        4
# 101_ItemName      13       14
# 10_ItemName        9       10
# 11_ItemName        1        2
# 1_ItemName         5        6
# 20_ItemName       11       12
# 2_ItemName         7        8

print(df.sort_index(key=(lambda x: int(x.str.split('_')[0]))))
# TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

Environment

Python 3.10.5
Pandas 1.4.3

CodePudding user response:

Try df.sort_index with custom key=:

df = df.sort_index(
    key=lambda idx: idx.str.split("_")
    .str[0]
    .astype(int)
)
print(df)

Prints:

              column1  column2
0_ItemName          3        4
1_ItemName          5        6
2_ItemName          7        8
10_ItemName         9       10
11_ItemName         1        2
20_ItemName        11       12
101_ItemName       13       14

CodePudding user response:

Use natsort for natural sorting:

# pip install natsort
from natsort import natsort_key

out = df.sort_index(key=natsort_key)

output:

              column1  column2
0_ItemName          3        4
1_ItemName          5        6
2_ItemName          7        8
10_ItemName         9       10
11_ItemName         1        2
20_ItemName        11       12
101_ItemName       13       14

CodePudding user response:

As per the pandas docs, the key function needs to be vectorised:

keycallable, optional If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape. For MultiIndex inputs, the key is applied per level.

So your key function needs to accept the whole index and return an index rather than scalar values.

As I write this Andrej has just supplied an example of how this is done in his answer.

Depending on your coding style you might prefer to avoid the lambda and use a named function

def sort_by_before_underscore(ix):
    return (ix.str.split("_")
    .str[0]
    .astype(int))

df = df.sort_index(
    key=sort_by_before_underscore
)
  • Related