How to check similarity between two name pairs interchangeably and give them a unique identifier in-CodePudding

I have a dataframe as follows,

df_names = pd.DataFrame({'last_name':['Williams','Henry','XYX','Smith','David','Freeman','Walter','Test_A'],
                        'first_name':['Henry','Williams','ABC','David','Smith','Walter','Freeman','Test_B']})

A new column full name adding last and first names as below -

Here i would like to check how similar the full names are ? Williams Henry and Henry Williams to be considered as same and give it a unique identifier some random code.

similarly Smith David and David Smith should also be consider as one unique identifier.

Final expected output as below.

CodePudding user response：

Use:

res = (df_names.assign(group=df_names[["last_name", "first_name"]].apply(frozenset, axis=1))
               .groupby("group")
               .ngroup()   1)

df_names["unique_identifier"] = "A-"   res.astype("string")
print(df_names)

Output

  last_name first_name unique_identifier
0  Williams      Henry               A-1
1     Henry   Williams               A-1
2       XYX        ABC               A-2
3     Smith      David               A-3
4     David      Smith               A-3
5   Freeman     Walter               A-4
6    Walter    Freeman               A-4
7    Test_A     Test_B               A-5

The idea is to use frozenset to map each row to an object where the order of the elements is irrelevant. It has to be a frozenset so is hashable, this is a requirement of pandas.

CodePudding user response：

Try this:

a = df_names.groupby(lambda idx: tuple(sorted(df_names.iloc[idx])))

Output:

a.apply(print)

  last_name first_name
2       XYX        ABC
  last_name first_name
5   Freeman     Walter
6    Walter    Freeman
  last_name first_name
0  Williams      Henry
1     Henry   Williams
  last_name first_name
3     Smith      David
4     David      Smith
  last_name first_name
7    Test_A     Test_B