Extract float after specific substring in pandas dataframe-CodePudding

I am working on a dataframe where a column holds a string with a lot of unnecessary information. I only need the currency and float after a list of specific "searchwords".

My dataframe column are strings, and the strings can look like this:

col1   |      actual col      |   col3
----------------------------------------
       | word    foo  blabla  |   
  1    | searchword: EUR 2.67 |   4
       | random EUR: 1234.53  |
----------------------------------------
       | hello world      foo |   
  2    | searchword: GBP 1.65 |   6
       | randword: GBP 123.4  |

The only thing i need from the string in the column is the currency and float after the searchword. The string can contain the currency in several places, so finding the float after the specified currency doesnt work as it might return the wrong float. There are also multiple searchwords, and i need some way to extract the currency and float after all the searchwords which is in a list. Any tips on how to do this? I have tried regex with no luck. I also tried extracting all the floats from the string, and finding the float based on its index, but this also leaves a risk of getting the wrong float.

Any help is appreciated, thanks.

EDIT: I have tried using regex, but this seems to extract only the 1st float which is not necessarily the correct one:

df['actual col'] = df['actual col'].str.extract('(\d*\.\d |\d )', expand=False).astype(float)

CodePudding user response：

I would make a simple function to parse the text as you wish, then apply it by row to the dataframe.

import pandas as pd

df = pd.DataFrame({
    'col1':[1,2],
    'actual col' : ['word    foo  blabla\nsearchword: EUR 2.67\nrandom EUR: 1234.53', 'hello world      foo\nsearchword: GBP 1.65\nrandword: GBP 123.4'],
    'col3':[4,6]
})





def parse_data(row):
    line = row['actual col'].replace('\n',' ')
    data = line.split('searchword: ')[-1].split()[:2]
    return ' '.join(data)

df['new col'] = df.apply(parse_data, axis=1)
print(df)

   col1                                         actual col  col3   new col
0     1  word    foo  blabla\nsearchword: EUR 2.67\nran...     4  EUR 2.67
1     2  hello world      foo\nsearchword: GBP 1.65\nra...     6  GBP 1.65

CodePudding user response：

You can use a regex, you need to specify your "searchword", the separators, etc.

df['actual col'] = df['actual col'].str.extract('searchword:\s*([A-Z] \s*\d (?:\.\d*)?)',
                                                 expand=False)

output:

   col1 actual col  col3
0     1   EUR 2.67     4
1     2   GBP 1.65     6

You can even extract the currency and amount individually:

df[['currency', 'amount']] = df['actual col'].str.extract('searchword:\s*([A-Z] )\s*(\d (?:\.\d*)?)',
                                                          expand=False)

output:

   col1  col3 currency amount
0     1     4      EUR   2.67
1     2     6      GBP   1.65