Why is my df.sort_values() not correctly sorting the data points?-CodePudding

I have a dataframe with returns from various investments in %. sort.values does not correctly order my returns. For example I just want to simply see the TEST column returns sorted by lowest to highest or vice versa. Please look at the test output, it is not correct.

df.sort_values('TEST')

gives me an output of returns that are NOT sorted correctly. Sort values code not in correct order

Also I am having an issue where it sorts positive numbers lowest to highest, then half way down starts again for negative numbers lowest to highest.

I just want it to look like following: -3% -1% -0.5% 1% 2% 5%

CodePudding user response：

Go for numpy.lexsort and boolean indexing :

import numpy as np

arr = np.array([float(x.rstrip("%")) for x in df["TEST"]])
idx = np.lexsort((arr,))

df = df.iloc[idx]

Output :

print(df)

    TEST
0    -3%
1    -1%
2  -0.5%
3     1%
4     2%
5     5%

Input used :

df = pd.DataFrame({"TEST": ["1%", "-3%","-0.5%", "-1%", "5%", "2%"]})

    TEST
0     1%
1    -3%
2  -0.5%
3    -1%
4     5%
5     2%

CodePudding user response：

The issue is that the lexicographic order of string is different from the natural order (1->10->2 vs 1->2->10).

One option using the key parameter of sort_values:

df.sort_values('TEST', key=lambda s: pd.to_numeric(s.str.extract(r'(-?\d \.?\d*)', expand=False)))

Or:

df.sort_values('TEST', key=lambda s: pd.to_numeric(s.str.rstrip('%')))

Output:

    TEST
1    -3%
3    -1%
2  -0.5%
0     1%
5     2%
4     5%