I have a followup question to the question I asked here.
My pandas dataframe consists of several columns, the first column has string values, the others have floats. It looks something like this:
Superheros hourly rate
Spiderman (Nr 1) 12.9
Batman (Nr 4) 93.7
Joker 23.6
Iron Man (Nr 2) 49.3
Hulk 32.5
Captain America 24.9
Wonderwoman (Nr 3) 65.8
In general the dataframe should be sorted based on the column of hourly rate, but all superheros with a number should be located at the bottom of the dataframe. The resulting dataframe should look like this
Superheros hourly rate
Hulk 32.5
Captain America 24.9
Joker 23.6
Batman (Nr 4) 93.7
Wonderwoman (Nr 3) 65.8
Iron Man (Nr 2) 49.3
Spiderman (Nr 1) 12.9
In my question from before I didn't have the condition of sorting by the second column, and the suggested solution led to this code snipplet:
df['Superheros'] = df['Superheros'].str.replace(r'\bim\s (Nr\s*\d )', r'(\1)', regex=True)
df = df.replace(np.nan, '-')
s = pd.to_numeric(df['Superheros'].str.extract('\(VG (\d )\)', expand=False))
idx = s.sort_values(na_position='first').index
df = df.loc[idx]
based on the same logic it was suggested to use:
sorter = lambda s: pd.to_numeric(s.str.extract('\(Nr (\d )\)', expand=False))
out = df.sort_values(by='Superheros', key=sorter, na_position='first')
So I thought I could simply change by='Superheros' to by='hourly rate', but this does not work (I think I understand why). Unfortunately I don't have any other idea on how else to do this, besides maybe splitting the dataframe in two parts, sorting them seperatly and then merging them back together. Is there a nicer pythonic way to do this?
Thanks in advance!
CodePudding user response:
Use a temporary column to sort the Superheros with numbers:
(df
.assign(bottom=df['Superheros'].str.contains('\d\)'))
.sort_values(by=['bottom', 'hourly rate'], ascending=[True, False])
.drop(columns='bottom')
)
output:
Superheros hourly rate
4 Hulk 32.5
5 Captain America 24.9
2 Joker 23.6
1 Batman (Nr 4) 93.7
6 Wonderwoman (Nr 3) 65.8
3 Iron Man (Nr 2) 49.3
0 Spiderman (Nr 1) 12.9
intermediate (without dropping the 'bottom' column):
True
(=1) is sorted after False
(=0).
Superheros hourly rate bottom
4 Hulk 32.5 False
5 Captain America 24.9 False
2 Joker 23.6 False
1 Batman (Nr 4) 93.7 True
6 Wonderwoman (Nr 3) 65.8 True
3 Iron Man (Nr 2) 49.3 True
0 Spiderman (Nr 1) 12.9 True
CodePudding user response:
As an alternative solution, you can use sort_values
twice, the second one with a key
parameter:
res = (df.sort_values(by="hourly rate", ascending=False)
.sort_values(by=["Superheros"], key=lambda x: x.str.contains(pat=r"\d", regex=True)))
print(res)
Output
Superheros hourly rate
4 Hulk 32.5
5 Captain America 24.9
2 Joker 23.6
1 Batman (Nr 4) 93.7
6 Wonderwoman (Nr 3) 65.8
3 Iron Man (Nr 2) 49.3
0 Spiderman (Nr 1) 12.9
Note on performance
Sometimes sorting twice can be faster, see this. Additionally for an out-sample of your original df (see sample
):
df = df.sample(1000, replace=True) # len(df) is 1000
I get the following timings:
%timeit df.sort_values(by="hourly rate", ascending=False).sort_values(by=["Superheros"], key=lambda x: x.str.contains(pat=r"\d", regex=True))
781 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df.assign(bottom=df['Superheros'].str.contains('\d\)')).sort_values(by=['bottom', 'hourly rate'], ascending=[True, False]).drop(columns='bottom')
1.36 ms ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)