I need to regex to filter out specific characters from strings in my dataset. How can I filter out the numerical digits and the "-" symbol when it is between the numbers, and skip over the "-" symbol when it is between alphabetical characters, cause the regular expression that I use now filters out every "-" symbol between any alphanumeric character in addition to when it is between numeric characters.
Example:
Problem: "Non-Profit Organization management, 100-200 employees" Current outcome: "NonProfit Organization management, employees" Desired outcome: "Non-Profit Organization management, employees"
if 'business' in row.keys():
row['business'] = re.sub("[0-9-][0-9]*", '', str(row['business']))
CodePudding user response:
In python:
string = "Non-Profit Organization management, 100-200 employees"
re.sub("(\d )-(\d )", "", string)
Output:
'Non-Profit Organization management, employees'
CodePudding user response:
You need to use the expression \d -\d
in order to replace all - including digits (\d) with empty strings.
print(re.sub("\d -\d *", "", "Non-Profit Organization management, 100-200 employees"))
Results in "Non-Profit Organization management, employees"
Note that I added *
to the pattern in order to remove spaces after the number, too.
Suggestion: If you perform this operation several times, I would suggest you to do as follows:
import re
pattern = re.compile("\d -\d *")
print(pattern.sub("", "Non-Profit Organization management, 100-200 employees"))
So Python doesn't need to compile the pattern every time.