How to filter out specific characters using Regex in Python-CodePudding

I need to regex to filter out specific characters from strings in my dataset. How can I filter out the numerical digits and the "-" symbol when it is between the numbers, and skip over the "-" symbol when it is between alphabetical characters, cause the regular expression that I use now filters out every "-" symbol between any alphanumeric character in addition to when it is between numeric characters.

Example:

Problem: "Non-Profit Organization management, 100-200 employees" Current outcome: "NonProfit Organization management, employees" Desired outcome: "Non-Profit Organization management, employees"

if 'business' in row.keys():
            row['business'] = re.sub("[0-9-][0-9]*", '', str(row['business']))

CodePudding user response：

In python:

string = "Non-Profit Organization management, 100-200 employees"
re.sub("(\d )-(\d )", "", string)

Output:

'Non-Profit Organization management,  employees'

CodePudding user response：

You need to use the expression \d -\d in order to replace all - including digits (\d) with empty strings.

print(re.sub("\d -\d  *", "", "Non-Profit Organization management, 100-200 employees"))

Results in "Non-Profit Organization management, employees"

Note that I added * to the pattern in order to remove spaces after the number, too.

Suggestion: If you perform this operation several times, I would suggest you to do as follows:

import re
pattern = re.compile("\d -\d  *")
print(pattern.sub("", "Non-Profit Organization management, 100-200 employees"))

So Python doesn't need to compile the pattern every time.