Home > other >  Sorting a Pandas Frame which using is natural sort by multiple column labels
Sorting a Pandas Frame which using is natural sort by multiple column labels

Time:10-14

I want to sort a pandas frame by multiple columns. The constraint I have which is giving me trouble ist, that one of the columns (the first) needs to be natural sorted, so I tried the following:

sortedFrame = inFrame.sort_values(by=['Col_Arg', 'Col_Step'],
                                      key=lambda x:np.argsort(index_natsorted(inFrame['Col_Arg'])))

but this code results in the frame only being sorted by Col_Arg. E.g. the input frame

Col_Arg Col_Step
1 First 20
2 Second 10
1 First 10

results in

Col_Arg Col_Step
1 First 20
1 First 10
2 Second 10

You can imagine Col_Arg as an indexed headline. Inside that indexed headline are steps to execute (Col_2). Since Col_Arg is an string which cannot be transformed in an integer, I want to use natsort, which is working fine to sort for Col_Arg alone, but is not working with multiple column names. The easy way is just to introduce an additional index for the headlines. Then I could just easily use:

sortedFrame = inFrame.sort_values(['Col_Arg_Idx', 'Col_2'])

Since I am quite new to python and pandas I am curious and I want to understand what is my misconception and how you would do it, since I think I should be possible. I can imagine that it has to do with the usage of key:

key: keycallable, optional Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.

But that does not mean it is applied on all... I am a little confused.

In order to introduce intermediate steps later, the indexes are initially incremented by 10 starting from 10.

Thanks in advance.

CodePudding user response:

So the problem with your code is that when you use multiple columns inside pandas.DataFrame.sort_values, alongside the key parameter, pandas takes each series in the order you've defined inside the parameter by=["col1", "col2"] and calls that function passing the column values to it in the order they appeared before calling sort_values.

For example, let's define a simple function that only prints the arguments it receives, and use it as our key parameter:

import pandas as pd
import numpy as np
from natsort import index_natsorted, natsort_keygen, natsorted


# == Sample DataFrame ============================
df = pd.DataFrame(
    {
        'Col_Arg': ['First', 'Second', 'First', 'Third', 'Fourth', 'Tenth'],
        'Col_Step': [20, 10, 10, 30, 20, 5]
    }
)


# == Simple Key Function ============================

def print_values(x):
    print(x)
    return x


df.sort_values(
   by=["Col_Arg", "Col_Step"],
   key=lambda x: print_values(x)
)
# Returns:
"""
0     1 First
1    2 Second
2     1 First
3     3 Third
4    4 Fourth
5    10 Tenth
Name: Col_Arg, dtype: object
0    20
1    10
2    10
3    30
4    20
5     5
Name: Col_Step, dtype: int64
"""

So basically pandas.DataFrame.sort_values passes each column as a series, to your function, and it expects your function to do some transformation to make the column "sortable". Here's the parameter description from pandas documentation:

key: callable, optional

Description:

Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.

In other words, if you want to sort both columns in the same pandas.DataFrame.sort_values operation, you need to pass in a function that's able to convert 'Col_Arg' to a numeric form, while returning 'Col_Step' unmodified. Additionally, by using inFrame in key=lambda x:np.argsort(index_natsorted(inFrame['Col_Arg'])) instead of passing x, the key function will sort values based on the inFrame indexes in the order they existed prior to calling the sort_values function. Here's an example:


df.sort_values(
   by=["Col_Arg", "Col_Step"],
   key=lambda x: print_values(np.argsort(index_natsorted(df["Col_Step"])))
)
# Prints:
"""
[3 1 2 5 4 0]
[3 1 2 5 4 0]
"""

So, the first time the key function gets called it sorts the dataframe indexes using [3 1 2 5 4 0], then it applies the same order as it did before, but now all indexes have already been moved, so it ends up ruining the sort operation.

Quick Fix

As stated previously, the key function takes each column value in the order they exist prior to the sort operation. So we need to create a function that converts 'Col_Arg' values to numbers, instead of trying to sort inside the key function. There's a package called number-parser that does that for you. To install it, run the following code:

pip install number-parser

Then, you can create a function to use inside key like so:


import numpy as np
import pandas as pd
from number_parser import parse_ordinal


def custom_sort(col: pd.Series) -> pd.Series:
    if col.name == "Col_Arg":
        return col.apply(parse_ordinal)
    return col

df.sort_values(
   by=["Col_Arg", "Col_Step"],
   key=custom_sort
)
# Returns:
"""
  Col_Arg  Col_Step
2   First        10
0   First        20
1  Second        10
3   Third        30
4  Fourth        20
5   Tenth         5
"""

Solution 2: another option would be to do something like this:

import pandas as pd
import numpy as np
from natsort import index_natsorted, natsort_keygen, natsorted


df.sort_values(
    by="Col_Arg", key=lambda col: np.argsort(index_natsorted(col))
).groupby("Col_Arg", as_index=False).apply(
    lambda grp: grp.sort_values("Col_Step")
).reset_index(
    drop=True
)
# Returns:
"""
  Col_Arg  Col_Step
0   First        10
1   First        20
2  Fourth        20
3  Second        10
4   Tenth         5
5   Third        30
"""
  • Related