Home > Net >  Extracting Floating Values from A String In A Dataframe
Extracting Floating Values from A String In A Dataframe

Time:12-14

I have the following dataframe, df:

name         result

AAA           4.5
BBB           UNK
CCC      less than 2.45
DDD     Men > 40: 2.5-3.5

The dtypes of the result column is dtype('O')

I need to extract the float values and filter them with a particular threshold.

I tried the following script to extract the values and change them to float:

df['result'] = df['result'].str.extract(r'(\d .\d )').astype('float')

However, I'm not sure if I'm getting the right output, especially when the value has a range 2.5-3.5. In this scenario, I want to max value, i.e. 3.5.

Desired output:

name         result

AAA           4.5
CCC           2.45
DDD           3.5

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response:

You can use extractall to extract all float number occurrences, then use max to take only the max value.

>>> ext = (df.result.str.extractall(r'(\d .\d )')
       .astype(float)
       .unstack()
       .max(axis=1))

>>> ext
0    4.50
2    2.45
3    3.50

Next, you can join this back to your original dataframe. ext is currently a Series, so convert it to dataframe before joining using to_frame. If you like to drop NaN, use inner join and if you want to keep it, use left (default).

>>> df = df[['name']].join(ext.to_frame('result'), how='inner')
>>> df
  name  result
0  AAA    4.50
2  CCC    2.45
3  DDD    3.50
  • Related