Home > database >  Regex - removing everything after first word following a comma
Regex - removing everything after first word following a comma

Time:02-06

I have a column that has name variations that I'd like to clean up. I'm having trouble with the regex expression to remove everything after the first word following a comma.

d = {'names':['smith,john s','smith, john', 'brown, bob s', 'brown, bob']}
x = pd.DataFrame(d)

Tried:
x['names'] =  [re.sub(r'/.\s [^\s,] /','', str(x)) for x in x['names']]

Desired Output:
['smith,john','smith, john', 'brown, bob', 'brown, bob']

Not sure why my regex isn't working, but any help would be appreciated.

CodePudding user response:

Try re.sub(r'/(,\s*\w ).*$/','$1', str(x))...

Put the triggered pattern into capture group 1 and then restore it in what gets replaced.

CodePudding user response:

You could try using a regex that looks for a comma, then an optional space, then only keeps the remaining word:

x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"\1")

0     smith,john
1    smith, john
2     brown, bob
3     brown, bob
Name: names, dtype: object
  • Related