Home > Net >  In python, clean up duplicate first and last name values in a dataframe
In python, clean up duplicate first and last name values in a dataframe

Time:07-22

I'm trying to remove duplicate words in a cell to clean up data dealing with first and last names (husband and wife).

          Current                  Desired
0  John Doe and Jane Doe         John Doe and Jane doe
1  John Doe and John Doe         John Doe
2  John Doe                      John Doe
3  Jane Doe and Jane Doe         Jane Doe
4  Jane Doe and Jane             Jane Doe
5  John and John Doe             John Doe

I have the following that works well with cleaning up John and John to John:

df['out'] = df.Current.str.split(' and ').map(lambda x : ' and '.join(set(x)))

How can I adjust my code to account for other cases to clean up first and last names?

CodePudding user response:

Try:

import re

pat = re.compile(r"(. ?)\s and\s (. )")


def clean(x):
    m = pat.match(x)

    if not m:
        return x

    if m.group(1).split()[0] == m.group(2).split()[0]:
        return max(m.group(1), m.group(2), key=len)

    return x


df["Desired 2"] = df["Current"].apply(clean)
print(df)

Prints:

                 Current                Desired              Desired 2
0  John Doe and Jane Doe  John Doe and Jane doe  John Doe and Jane Doe
1  John Doe and John Doe               John Doe               John Doe
2               John Doe               John Doe               John Doe
3  Jane Doe and Jane Doe               Jane Doe               Jane Doe
4      Jane Doe and Jane               Jane Doe               Jane Doe
5      John and John Doe               John Doe               John Doe
  • Related