Python regex removing duplicated names-CodePudding

I have a Series of names. If a name is repeated, I'd like to have only one.

John Smith
David BrownDavid Brown

I'd like to have output

John Smith
David Brown

I found ways to use '\b(\w )( \1\b) ' to catch the white space between names and keep the second one with r'\1'. However, in my case, there is no whitespace. Does that mean I need to compare strings character by character to find duplicates? Is there any simpler way ?

CodePudding user response：

You can use a non-greedy modifier(?) to test the words to find all the dupilcates optionally:

\b(\w ? \w ?)\1*\b

Check the test cases

You may also add another name section to support middle names such as:

\b(\w ? \w ?(?: \w ?)?)\1*\b

CodePudding user response：

You can use

\b(. ?)\1\b

See the regex demo. Details:

\b - a word boundary
(. ?) - Group 1: one or more chars other than line break chars as few as possible
\1 - Same value as in Group 1
\b - a word boundary

In Pandas, you can use

df['column_name'] = df['column_name'].str.replace(r'\b(. ?)\1\b', r'\1', regex=True)