Extracting Floating Values from A String In A Dataframe-CodePudding

I have the following dataframe, df:

name         result

AAA           4.5
BBB           UNK
CCC      less than 2.45
DDD     Men > 40: 2.5-3.5

The dtypes of the result column is dtype('O')

I need to extract the float values and filter them with a particular threshold.

I tried the following script to extract the values and change them to float:

df['result'] = df['result'].str.extract(r'(\d .\d )').astype('float')

However, I'm not sure if I'm getting the right output, especially when the value has a range 2.5-3.5. In this scenario, I want to max value, i.e. 3.5.

Desired output:

name         result

AAA           4.5
CCC           2.45
DDD           3.5

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response：

You can use extractall to extract all float number occurrences, then use max to take only the max value.

>>> ext = (df.result.str.extractall(r'(\d .\d )')
       .astype(float)
       .unstack()
       .max(axis=1))

>>> ext
0    4.50
2    2.45
3    3.50

Next, you can join this back to your original dataframe. ext is currently a Series, so convert it to dataframe before joining using to_frame. If you like to drop NaN, use inner join and if you want to keep it, use left (default).

>>> df = df[['name']].join(ext.to_frame('result'), how='inner')
>>> df
  name  result
0  AAA    4.50
2  CCC    2.45
3  DDD    3.50