Home > Back-end >  Python Pandas - Replacing cell value matching regular expression
Python Pandas - Replacing cell value matching regular expression

Time:09-05

New to Python and regex - I am trying to write a code to replace values in a column of my data frame which match a particular pattern.

Some sample values in my data frame are: Self- Employed, Self-Employeed,Self -employed,Retired (Self-emplyed)

The pattern I have is - If the value contains words starting with Self- or Self - or Self (space) [Ignore the case of self] then replace the entire value with "Self Employed"

Below is what i have tried -

data = pd.read_csv('loan.csv',keep_default_na=False)
data['emp_title']=data['emp_title'].replace('^Self','Self Employed',regex=True)

There are 2 problems -

  1. The regex seems incorrect. It is finding all the words starting with self but not contains. And it is taking into consideration the case of the word.

  2. Replace function seems to be replacing only the word "Self" and not the full value. For eg: If the value in my cell is "Self employed", python seems to replace only Self and makes the value "Self Employed employed"

Any help is appreciated.

CodePudding user response:

You may change to this code:

data = pd.read_csv('loan.csv',keep_default_na=False)
data['emp_title']=data['emp_title'].replace(r'(?i)\bself\b(?:\s*-\s*employed)?(?!\s Employed\b)', 'Self Employed', regex=True)

RegEx Demo

RegEx Breakup:

  • r'...': Raw string
  • (?i): Ignore case mode
  • \bSelf\b: Match complete word Self (case ignore)
  • (?:\s*-\s*employed)?: An optional match to match -employed
  • (?!\s Employed\b): Negative lookahead to assert that Self is not followed by 1 whitespace and word Employed
  • Related