I have the dataframe below and have created a column to catagorise based on specific text within a string.
However when I pass re.IGNORECASE
flag it is still case sensetive?
Dataframe
test_data = {
"first_name": ['Bruce', 'Clark', 'Bruce', 'James', 'Nanny', 'Dot'],
"last_name": ['Lee', 'Kent', 'Banner', 'Bond', 'Mc Phee', 'Cotton'],
"title": ['mr', 'mr', 'mr', 'mr', 'mrs', 'mrs'],
"text": ["He is a Kung Fu master", "Wears capes and tight Pants", "Cocktails shaken not stirred", "angry Green man", "suspect scottish accent", "East end legend"],
"age": [32, 33, 28, 30, 42, 80]
}
df = pd.DataFrame(test_data)
code
category_dict = {
"Kung Fu":"Martial Art",
"capes":"Clothing",
"cocktails": "Drink",
"green": "Colour",
"scottish": "Scotland",
"East": "Direction"
}
df['category'] = (
df['text'].str.extract(
fr"\b({'|'.join(category_dict.keys())})\b",
flags=re.IGNORECASE)[0].map(category_dict))
Expected output
first_name last_name title text age category
0 Bruce Lee Mr He is a Kung Fu master 32 Martial Art
1 Clark Kent Mr Wears capes and tight Pants 33 Clothing
2 Bruce Banner Mr Cocktails shaken not stirred 28 Drink
3 James Bond Mr angry Green man 30 Colour
4 Nanny Mc Phee Mrs suspect scottish accent 42 Scotland
5 Dot Cotton Mrs East end legend 80 Direction
I have searched the docs and have found no pointers, so any help would be appreciated!
CodePudding user response:
here is one way to do it
the issue you're facing being that while the extract ignores the case, the extracted string mapping to dictionary is still case sensitive.
#create a dictionary with lower case keys
cd= {k.lower(): v for k,v in category_dict.items()}
# alternately, you can convert the category_dict keys to lower case
# I duplicated the dictionary, in case you need to keep the original keys
# convert the extracted word to lowercase and then map with the lowercase dict
df['category'] = (
df['text'].str.extract(
fr"\b({'|'.join((category_dict.keys()))})\b",
flags=re.IGNORECASE)[0].str.lower().map(cd))
df
first_name last_name title text age category
0 Bruce Lee mr He is a Kung Fu master 32 Martial Art
1 Clark Kent mr Wears capes and tight Pants 33 Clothing
2 Bruce Banner mr Cocktails shaken not stirred 28 Drink
3 James Bond mr angry Green man 30 Colour
4 Nanny Mc Phee mrs suspect scottish accent 42 Scotland
5 Dot Cotton mrs East end legend 80 Direction