I have the following dataframe, df
:
name result
AAA 4.5
BBB UNK
CCC less than 2.45
DDD Men > 40: 2.5-3.5
The dtypes
of the result
column is dtype('O')
I need to extract the float values and filter them with a particular threshold.
I tried the following script to extract the values and change them to float:
df['result'] = df['result'].str.extract(r'(\d .\d )').astype('float')
However, I'm not sure if I'm getting the right output, especially when the value has a range 2.5-3.5
. In this scenario, I want to max value, i.e. 3.5
.
Desired output:
name result
AAA 4.5
CCC 2.45
DDD 3.5
What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!
CodePudding user response:
You can use extractall
to extract all float number occurrences, then use max
to take only the max value.
>>> ext = (df.result.str.extractall(r'(\d .\d )')
.astype(float)
.unstack()
.max(axis=1))
>>> ext
0 4.50
2 2.45
3 3.50
Next, you can join this back to your original dataframe. ext
is currently a Series, so convert it to dataframe before joining using to_frame
. If you like to drop NaN, use inner
join and if you want to keep it, use left
(default).
>>> df = df[['name']].join(ext.to_frame('result'), how='inner')
>>> df
name result
0 AAA 4.50
2 CCC 2.45
3 DDD 3.50