What I would like to achieve
I have a DataFrame whose indices are "ID (int) underscore (_) name (str)". I would like to sort the data based on the ID.
import pandas as pd
data = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14]]
index = ['11_ItemName', '0_ItemName', '1_ItemName', '2_ItemName', '10_ItemName', '20_ItemName', '101_ItemName']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Output
# column1 column2
# 11_ItemName 1 2
# 0_ItemName 3 4
# 1_ItemName 5 6
# 2_ItemName 7 8
# 10_ItemName 9 10
# 20_ItemName 11 12
# 101_ItemName 13 14
# print(DO SOMETHING!)
## Expected output
# column1 column2
# 0_ItemName 3 4
# 1_ItemName 5 6
# 2_ItemName 7 8
# 10_ItemName 9 10
# 11_ItemName 1 2
# 20_ItemName 11 12
# 101_ItemName 13 14
What I tested
I tried to use sort_index
and failed.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html
To be honest, I'm stuck at the debug of lambda
.
df(sort_index())
## Output
# column1 column2
# 0_ItemName 3 4
# 101_ItemName 13 14
# 10_ItemName 9 10
# 11_ItemName 1 2
# 1_ItemName 5 6
# 20_ItemName 11 12
# 2_ItemName 7 8
print(df.sort_index(key=(lambda x: int(x.str.split('_')[0]))))
# TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'
Environment
Python 3.10.5
Pandas 1.4.3
CodePudding user response:
Try df.sort_index
with custom key=
:
df = df.sort_index(
key=lambda idx: idx.str.split("_")
.str[0]
.astype(int)
)
print(df)
Prints:
column1 column2
0_ItemName 3 4
1_ItemName 5 6
2_ItemName 7 8
10_ItemName 9 10
11_ItemName 1 2
20_ItemName 11 12
101_ItemName 13 14
CodePudding user response:
Use natsort
for natural sorting:
# pip install natsort
from natsort import natsort_key
out = df.sort_index(key=natsort_key)
output:
column1 column2
0_ItemName 3 4
1_ItemName 5 6
2_ItemName 7 8
10_ItemName 9 10
11_ItemName 1 2
20_ItemName 11 12
101_ItemName 13 14
CodePudding user response:
As per the pandas docs, the key function needs to be vectorised:
keycallable, optional If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape. For MultiIndex inputs, the key is applied per level.
So your key function needs to accept the whole index and return an index rather than scalar values.
As I write this Andrej has just supplied an example of how this is done in his answer.
Depending on your coding style you might prefer to avoid the lambda and use a named function
def sort_by_before_underscore(ix):
return (ix.str.split("_")
.str[0]
.astype(int))
df = df.sort_index(
key=sort_by_before_underscore
)