Home > database >  Extract strings based on custom list of items
Extract strings based on custom list of items

Time:09-21

Say we have this df:

import pandas as pd
df = pd.DataFrame({'a': ['hair color other family, friends ', 'family, friends hair color']})

    a
0   hair color other family, friends
1   family, friends hair color

I want to extract strings using my own list of items:

items = ['hair color', 'other', 'family, friends']

I want to do this because there are no consistent delimiter or pattern in the raw data.

Desired output:

import numpy as np
desired_output = pd.DataFrame({'a': ['hair color other family, friends ', 'family, friends hair color'],
                                   'hair color': ['hair color', 'hair color'],
                                   'other': ['other', np.nan],
                                   'family, friends': ['family, friends', 'family, friends']
                                  })


                                  a     hair color  other   family, friends
0   hair color other family, friends    hair color  other   family, friends
1   family, friends hair color          hair color  NaN     family, friends

CodePudding user response:

You can craft a regex to use with str.extractall:

import re

regex = '|'.join([f'({re.escape(i)})' for i in items])
# '(hair\\ color)|(other)|(family,\\ friends)'

df.join(df['a'].str.extractall(regex)
                   .set_axis(items, axis=1)
                   .groupby(level=0).first())

output:

                                   a  hair color  other  family, friends
0  hair color other family, friends   hair color  other  family, friends
1         family, friends hair color  hair color   None  family, friends

update:

df.join(df['a'].str.extractall(regex)
                   .set_axis(items, axis=1)
                   .groupby(level=0).first()
                   .add_prefix('item1_')
                   .replace({None: np.nan})
       )

output:

                                   a item1_hair color item1_other item1_family, friends
0  hair color other family, friends        hair color       other       family, friends
1         family, friends hair color       hair color         NaN       family, friends
  • Related