Home > database >  Loop over regular expressions using Pandas str.extract
Loop over regular expressions using Pandas str.extract

Time:07-11

I want to extract numeric values from arbitrary strings in a column in my pandas dataframe.

Two regexes that shall be looped over the column "watt" using str.extract.

The str.extract function shall be applied to all NaN values.

On the next iteration, non NaN values (=matches) shall be excluded from the str.extract operation so that previous results are retained and not overwritten.

I must be totally misunderstanding something here, because my implementation is not working.

Although I am using .isnan() to filter out previous matches, it overwrites previous matches.

import pandas as pd

df = pd.DataFrame([{'title':'This bulb operates at 222 watts and is fabulous.'},
                   {'title':'This bulb operates at 999 w and is fantastic.'}])

regexes = ['([0-9\.,]{1,})[\s\-]{0,1}watt[s]{0,1} ', '([0-9\.,]{1,})[\s\-]{0,1}w ']

for regex in regexes:
    
    #create column with nan values on first iteration
    if 'watt' not in df.columns:
        df['watt'] = np.nan
                
    #select only rows from "watt" column with nan values -> run str.extract() -> update df "watt" column, repeat... 
    print(df[df['watt'].isnull()]['title'].str.extract(regex)) #debug
    df['watt'] = df[df['watt'].isnull()]['title'].str.extract(regex)

print(df)

CodePudding user response:

I think you have 2 options within your framework: In both you should mask the NaNs, in the column you're searching in as well as the column you're writing to.

Since .str.extract() returns a series with expand=False (default), the writing needs a bit of tuning (using .values):

regexes = [r'([0-9\.,] )[\s\-]?watt[s]? ', r'([0-9\.,] )[\s\-]?w ']
df['watt'] = np.nan
for regex in regexes:                    
    mask = df['watt'].isna()
    df.loc[mask, 'watt'] = df.loc[mask, 'title'].str.extract(regex).values

Or you could use named groups in the regex such that the group name matches the label of the column you're writing to:

regexes = [r'(?P<watt>[0-9\.,] )[\s\-]?watt[s]? ', r'(?P<watt>[0-9\.,] )[\s\-]?w ']
df['watt'] = np.nan
for regex in regexes:                    
    mask = df['watt'].isna()
    df.loc[mask, 'watt'] = df.loc[mask, 'title'].str.extract(regex)

Both produce the following result:

                                              title watt
0  This bulb operates at 222 watts and is fabulous.  222
1     This bulb operates at 999 w and is fantastic.  999
  • Related