Home > database >  Sort only parts of pandas Dataframe based on column values
Sort only parts of pandas Dataframe based on column values

Time:07-19

I have a followup question to the question I asked here.

My pandas dataframe consists of several columns, the first column has string values, the others have floats. It looks something like this:

Superheros        hourly rate
Spiderman (Nr 1)   12.9
Batman (Nr 4)      93.7
Joker              23.6
Iron Man (Nr 2)    49.3
Hulk               32.5
Captain America    24.9
Wonderwoman (Nr 3) 65.8

In general the dataframe should be sorted based on the column of hourly rate, but all superheros with a number should be located at the bottom of the dataframe. The resulting dataframe should look like this

Superheros        hourly rate
Hulk               32.5
Captain America    24.9
Joker              23.6
Batman (Nr 4)      93.7
Wonderwoman (Nr 3) 65.8
Iron Man (Nr 2)    49.3
Spiderman (Nr 1)   12.9

In my question from before I didn't have the condition of sorting by the second column, and the suggested solution led to this code snipplet:

df['Superheros'] = df['Superheros'].str.replace(r'\bim\s (Nr\s*\d )', r'(\1)', regex=True)
df = df.replace(np.nan, '-')
s = pd.to_numeric(df['Superheros'].str.extract('\(VG (\d )\)', expand=False))
idx = s.sort_values(na_position='first').index
df = df.loc[idx]

based on the same logic it was suggested to use:

sorter = lambda s: pd.to_numeric(s.str.extract('\(Nr (\d )\)', expand=False))

out = df.sort_values(by='Superheros', key=sorter, na_position='first')

So I thought I could simply change by='Superheros' to by='hourly rate', but this does not work (I think I understand why). Unfortunately I don't have any other idea on how else to do this, besides maybe splitting the dataframe in two parts, sorting them seperatly and then merging them back together. Is there a nicer pythonic way to do this?

Thanks in advance!

CodePudding user response:

Use a temporary column to sort the Superheros with numbers:

(df
 .assign(bottom=df['Superheros'].str.contains('\d\)'))
 .sort_values(by=['bottom', 'hourly rate'], ascending=[True, False])
 .drop(columns='bottom')
)

output:

           Superheros  hourly rate
4                Hulk         32.5
5     Captain America         24.9
2               Joker         23.6
1       Batman (Nr 4)         93.7
6  Wonderwoman (Nr 3)         65.8
3     Iron Man (Nr 2)         49.3
0    Spiderman (Nr 1)         12.9

intermediate (without dropping the 'bottom' column):

True (=1) is sorted after False (=0).

           Superheros  hourly rate  bottom
4                Hulk         32.5   False
5     Captain America         24.9   False
2               Joker         23.6   False
1       Batman (Nr 4)         93.7    True
6  Wonderwoman (Nr 3)         65.8    True
3     Iron Man (Nr 2)         49.3    True
0    Spiderman (Nr 1)         12.9    True

CodePudding user response:

As an alternative solution, you can use sort_values twice, the second one with a key parameter:

res = (df.sort_values(by="hourly rate", ascending=False)
       .sort_values(by=["Superheros"], key=lambda x: x.str.contains(pat=r"\d", regex=True)))
print(res)

Output

           Superheros  hourly rate
4                Hulk         32.5
5     Captain America         24.9
2               Joker         23.6
1       Batman (Nr 4)         93.7
6  Wonderwoman (Nr 3)         65.8
3     Iron Man (Nr 2)         49.3
0    Spiderman (Nr 1)         12.9

Note on performance

Sometimes sorting twice can be faster, see this. Additionally for an out-sample of your original df (see sample):

df = df.sample(1000, replace=True)  # len(df) is 1000

I get the following timings:

%timeit df.sort_values(by="hourly rate", ascending=False).sort_values(by=["Superheros"], key=lambda x: x.str.contains(pat=r"\d", regex=True))
781 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df.assign(bottom=df['Superheros'].str.contains('\d\)')).sort_values(by=['bottom', 'hourly rate'], ascending=[True, False]).drop(columns='bottom')
1.36 ms ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  • Related