Home > Software engineering >  creating new column using regex if certain keywords are found in other column values
creating new column using regex if certain keywords are found in other column values


I have dataframe (df) column called A which is string

 index   A
  0      boy_was_born_in_2010
  1      men_was_born_in_1997
  2      girl_this_is_2022
  3      this_is_a_lady
  4      how_tall_is_this_boy
  5      girl_is_studying

Now I wrote a code which would identify specific words like boy, girl, men, lady if it found that keyword in the column A create a new column B, given below which is if the string A contain boy then B will have Kid as a value and contain men->Male, girl->Kid, lady->female

 index   A                         B
  0      boy_was_born_in_2010      Kid
  1      men_was_born_in_1997      Male
  2      girl_this_is_2022         Kid
  3      this_is_a_lady            Female
  4      how_tall_is_this_boy      Kid
  5      girl_is_studying          Kid

I have used the following code


this is working fine but when i apply it for other rows than the previously applied value which is the above somehow gets undo, moreover I dont think this is a optimized way

also tried this

df['B'] = df['A'].replace(to_replace ='^boy\W', value = 'Kid', regex = True) # not working

if you have anyother way of doing it please suggest I'am a complete beginner just started working on NLP projects

CodePudding user response:

Adding an example with multiple matches:

import pandas as pd

data = {'A': ['boy_was_born_in_2010',
  'boy meets men']}

df = pd.DataFrame(data)

0  boy_was_born_in_2010
1  men_was_born_in_1997
2     girl_this_is_2022
3        this_is_a_lady
4  how_tall_is_this_boy
5      girl_is_studying
6         boy meets men

Set up a dict for mapping, use str.extract on pattern with alternatives, sqeeuze the result, and finally apply map:

a_dict = {'boy':'Kid',

pattern = '('   '|'.join(list(a_dict))   ')'
df['B'] = df.A.str.extract(pattern).squeeze().map(a_dict)


                      A       B
0  boy_was_born_in_2010     Kid
1  men_was_born_in_1997    Male
2     girl_this_is_2022     Kid
3        this_is_a_lady  Female
4  how_tall_is_this_boy     Kid
5      girl_is_studying     Kid
6         boy meets men     Kid

Above method will only get you the first match. E.g. boy -> Kid for last string. If you want to get all matches, we can use something like this:

temp = df.A.str.extractall(pattern).squeeze().map(a_dict)
df['B'] = temp.groupby(level=0).agg(', '.join)


                      A          B
0  boy_was_born_in_2010        Kid
1  men_was_born_in_1997       Male
2     girl_this_is_2022        Kid
3        this_is_a_lady     Female
4  how_tall_is_this_boy        Kid
5      girl_is_studying        Kid
6         boy meets men  Kid, Male

CodePudding user response:

Assuming the first occurrence sets the category type, you can consider using the following solution based on numpy.select:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':["boy_was_born_in_2010","men_was_born_in_1997","girl_this_is_2022","this_is_a_lady","how_tall_is_this_boy","girl_is_studying"]})

condlist = [
    df['A'].str.contains(r'(?<![^\W\d_])(?:boy|girl)s?(?![^\W\d_])', case=False, regex=True),
    df['A'].str.contains(r'(?<![^\W\d_])m[ea]n(?![^\W\d_])', case=False, regex=True),
    df['A'].str.contains(r'(?<![^\W\d_])lad(?:y|ies)(?![^\W\d_])', case=False, regex=True),

choicelist = ['Kid', 'Male', 'Female']
df['B'] = np.select(condlist, choicelist)


>>> df
                      A       B
0  boy_was_born_in_2010     Kid
1  men_was_born_in_1997    Male
2     girl_this_is_2022     Kid
3        this_is_a_lady  Female
4  how_tall_is_this_boy     Kid
5      girl_is_studying     Kid


  • (?<![^\W\d_])(?:boy|girl)s?(?![^\W\d_]) - matches boy / boys / girl / girls not enclosed with any letters (identified as Kid)
  • (?<![^\W\d_])m[ea]n(?![^\W\d_]) - man or men not enclosed with any letters (identified as Male)
  • (?<![^\W\d_])lad(?:y|ies)(?![^\W\d_]) - lady or ladies not enclosed with any letters (identified as Female).
  • Related