Home > Net >  re.IGNORCASE flag not working with .str.extract
re.IGNORCASE flag not working with .str.extract

Time:10-01

I have the dataframe below and have created a column to catagorise based on specific text within a string.

However when I pass re.IGNORECASE flag it is still case sensetive?

Dataframe

test_data = {
    "first_name": ['Bruce', 'Clark', 'Bruce', 'James', 'Nanny', 'Dot'],
    "last_name": ['Lee', 'Kent', 'Banner', 'Bond', 'Mc Phee', 'Cotton'],
    "title": ['mr', 'mr', 'mr', 'mr', 'mrs', 'mrs'],
    "text": ["He is a Kung Fu master", "Wears capes and tight Pants", "Cocktails shaken not stirred", "angry Green man", "suspect scottish accent", "East end legend"],
    "age": [32, 33, 28, 30, 42, 80]
}
df = pd.DataFrame(test_data)

code

category_dict = {
                 "Kung Fu":"Martial Art", 
                 "capes":"Clothing", 
                 "cocktails": "Drink", 
                 "green": "Colour", 
                 "scottish": "Scotland", 
                 "East": "Direction"
                 }

df['category'] = (
    df['text'].str.extract(
        fr"\b({'|'.join(category_dict.keys())})\b",
        flags=re.IGNORECASE)[0].map(category_dict))

Expected output

     first_name last_name title     text                     age     category
0      Bruce       Lee    Mr        He is a Kung Fu master   32  Martial Art
1      Clark      Kent    Mr   Wears capes and tight Pants   33     Clothing
2      Bruce    Banner    Mr  Cocktails shaken not stirred   28        Drink
3      James      Bond    Mr               angry Green man   30       Colour
4      Nanny   Mc Phee   Mrs       suspect scottish accent   42     Scotland
5        Dot    Cotton   Mrs               East end legend   80    Direction

I have searched the docs and have found no pointers, so any help would be appreciated!

CodePudding user response:

here is one way to do it

the issue you're facing being that while the extract ignores the case, the extracted string mapping to dictionary is still case sensitive.

#create a dictionary with lower case keys
cd= {k.lower(): v for k,v in category_dict.items()}

# alternately, you can convert the category_dict keys to lower case
# I duplicated the dictionary, in case you need to keep the original keys


# convert the extracted word to lowercase and then map with the lowercase dict
df['category'] = (
    df['text'].str.extract(
        fr"\b({'|'.join((category_dict.keys()))})\b",
        flags=re.IGNORECASE)[0].str.lower().map(cd))  
df
    first_name  last_name   title                     text     age  category
0   Bruce       Lee         mr        He is a Kung Fu master    32  Martial Art
1   Clark       Kent        mr   Wears capes and tight Pants    33  Clothing
2   Bruce       Banner      mr  Cocktails shaken not stirred    28  Drink
3   James       Bond        mr               angry Green man    30  Colour
4   Nanny       Mc Phee     mrs      suspect scottish accent    42  Scotland
5   Dot         Cotton      mrs              East end legend    80  Direction
  • Related