I need to filter out specific characters from strings in my dataset using regex and python.
I have data that consists of business and employee information. I would like to keep the number of employees as well as skip/exclude any cell that contains the word 'Self-employed' What regular expression can I use to remove everything except the characters stated.
Example:
String: "Health, Wellness and Fitness, 501-1000 employees"
Desired Outcome: 501-1000
Or:
String: "Retail, 10,000 employees"
Desired Outcome: 10,000 "
Or if the cell contains 'self-employed' it should skip that word, keep it in and go to the next cell:
String: Self-employed'
Desired Outcome: Self-employed"
I would like one pattern that can eliminate everything except what is requested in the desired outcome. Here is the code that I use but it does not seem to change anything, what am I doing wrong?
if 'employee' in row.keys():
row['employee'] = re.sub("([0-9] [,\-]*[0-9]*[ ]?|Self-employed)", '', str(row['employee']))
CodePudding user response:
re.sub
matches a regex pattern and replaces the match. You don't want to do that. You want the opposite - to match a pattern and use the match. So re.sub
doesn't seem like the right approach here. You could instead use re.search
to find groups that match your regex, then assign the results to your row['employee']
variable. Here is an example based on the code you've provided so far.:
if 'employee' in row.keys():
match = re.search("(\d[^\s]*|Self-employed)", str(row['employee']))
if match:
row['employee'] = match.group()
Credit to Porsche9II for the regex optimization.