I have a Series
of names. If a name is repeated, I'd like to have only one.
John Smith
David BrownDavid Brown
I'd like to have output
John Smith
David Brown
I found ways to use '\b(\w )( \1\b) '
to catch the white space between names and keep the second one with r'\1'
. However, in my case, there is no whitespace. Does that mean I need to compare strings character by character to find duplicates? Is there any simpler way ?
CodePudding user response:
You can use a non-greedy modifier(?
) to test the words to find all the dupilcates optionally:
\b(\w ? \w ?)\1*\b
Check the test cases
You may also add another name section to support middle names such as:
\b(\w ? \w ?(?: \w ?)?)\1*\b
CodePudding user response:
You can use
\b(. ?)\1\b
See the regex demo. Details:
\b
- a word boundary(. ?)
- Group 1: one or more chars other than line break chars as few as possible\1
- Same value as in Group 1\b
- a word boundary
In Pandas, you can use
df['column_name'] = df['column_name'].str.replace(r'\b(. ?)\1\b', r'\1', regex=True)