I'm trying to remove duplicate words in a cell to clean up data dealing with first and last names (husband and wife).
Current Desired
0 John Doe and Jane Doe John Doe and Jane doe
1 John Doe and John Doe John Doe
2 John Doe John Doe
3 Jane Doe and Jane Doe Jane Doe
4 Jane Doe and Jane Jane Doe
5 John and John Doe John Doe
I have the following that works well with cleaning up John and John
to John
:
df['out'] = df.Current.str.split(' and ').map(lambda x : ' and '.join(set(x)))
How can I adjust my code to account for other cases to clean up first and last names?
CodePudding user response:
Try:
import re
pat = re.compile(r"(. ?)\s and\s (. )")
def clean(x):
m = pat.match(x)
if not m:
return x
if m.group(1).split()[0] == m.group(2).split()[0]:
return max(m.group(1), m.group(2), key=len)
return x
df["Desired 2"] = df["Current"].apply(clean)
print(df)
Prints:
Current Desired Desired 2
0 John Doe and Jane Doe John Doe and Jane doe John Doe and Jane Doe
1 John Doe and John Doe John Doe John Doe
2 John Doe John Doe John Doe
3 Jane Doe and Jane Doe Jane Doe Jane Doe
4 Jane Doe and Jane Jane Doe Jane Doe
5 John and John Doe John Doe John Doe