I have dataframe (df) column called A which is string
index A
----------------------------
0 boy_was_born_in_2010
1 men_was_born_in_1997
2 girl_this_is_2022
3 this_is_a_lady
4 how_tall_is_this_boy
5 girl_is_studying
Now I wrote a code which would identify specific words like boy, girl, men, lady if it found that keyword in the column A create a new column B, given below which is if the string A contain boy then B will have Kid as a value and contain men->Male, girl->Kid, lady->female
index A B
------------------------------------------
0 boy_was_born_in_2010 Kid
1 men_was_born_in_1997 Male
2 girl_this_is_2022 Kid
3 this_is_a_lady Female
4 how_tall_is_this_boy Kid
5 girl_is_studying Kid
I have used the following code
df['B']=df.A.str.findall('boy').transform(''.join).replace('boy','Kid')
this is working fine but when i apply it for other rows than the previously applied value which is the above somehow gets undo, moreover I dont think this is a optimized way
also tried this
df['B'] = df['A'].replace(to_replace ='^boy\W', value = 'Kid', regex = True) # not working
if you have anyother way of doing it please suggest I'am a complete beginner just started working on NLP projects
CodePudding user response:
Adding an example with multiple matches:
import pandas as pd
data = {'A': ['boy_was_born_in_2010',
'men_was_born_in_1997',
'girl_this_is_2022',
'this_is_a_lady',
'how_tall_is_this_boy',
'girl_is_studying',
'boy meets men']}
df = pd.DataFrame(data)
print(df)
A
0 boy_was_born_in_2010
1 men_was_born_in_1997
2 girl_this_is_2022
3 this_is_a_lady
4 how_tall_is_this_boy
5 girl_is_studying
6 boy meets men
Set up a dict for mapping, use str.extract
on pattern
with alternatives, sqeeuze
the result, and finally apply map
:
a_dict = {'boy':'Kid',
'men':'Male',
'girl':'Kid',
'lady':'Female'}
pattern = '(' '|'.join(list(a_dict)) ')'
df['B'] = df.A.str.extract(pattern).squeeze().map(a_dict)
print(df)
A B
0 boy_was_born_in_2010 Kid
1 men_was_born_in_1997 Male
2 girl_this_is_2022 Kid
3 this_is_a_lady Female
4 how_tall_is_this_boy Kid
5 girl_is_studying Kid
6 boy meets men Kid
Above method will only get you the first match. E.g. boy
-> Kid
for last string. If you want to get all matches, we can use something like this:
temp = df.A.str.extractall(pattern).squeeze().map(a_dict)
df['B'] = temp.groupby(level=0).agg(', '.join)
print(df)
A B
0 boy_was_born_in_2010 Kid
1 men_was_born_in_1997 Male
2 girl_this_is_2022 Kid
3 this_is_a_lady Female
4 how_tall_is_this_boy Kid
5 girl_is_studying Kid
6 boy meets men Kid, Male
CodePudding user response:
Assuming the first occurrence sets the category type, you can consider using the following solution based on numpy.select
:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':["boy_was_born_in_2010","men_was_born_in_1997","girl_this_is_2022","this_is_a_lady","how_tall_is_this_boy","girl_is_studying"]})
condlist = [
df['A'].str.contains(r'(?<![^\W\d_])(?:boy|girl)s?(?![^\W\d_])', case=False, regex=True),
df['A'].str.contains(r'(?<![^\W\d_])m[ea]n(?![^\W\d_])', case=False, regex=True),
df['A'].str.contains(r'(?<![^\W\d_])lad(?:y|ies)(?![^\W\d_])', case=False, regex=True),
]
choicelist = ['Kid', 'Male', 'Female']
df['B'] = np.select(condlist, choicelist)
Output:
>>> df
A B
0 boy_was_born_in_2010 Kid
1 men_was_born_in_1997 Male
2 girl_this_is_2022 Kid
3 this_is_a_lady Female
4 how_tall_is_this_boy Kid
5 girl_is_studying Kid
Notes
(?<![^\W\d_])(?:boy|girl)s?(?![^\W\d_])
- matchesboy
/boys
/girl
/girls
not enclosed with any letters (identified asKid
)(?<![^\W\d_])m[ea]n(?![^\W\d_])
-man
ormen
not enclosed with any letters (identified asMale
)(?<![^\W\d_])lad(?:y|ies)(?![^\W\d_])
-lady
orladies
not enclosed with any letters (identified asFemale
).