Home > Back-end >  How to check whether entry in one row appears in another row but order of column values changed?
How to check whether entry in one row appears in another row but order of column values changed?

Time:10-07

If I have some dataframe with Firs Name and Last Name columns, how can I check whether below case exist in dataframe and therefore extract all of the such cases:

First Name Last Name
Albert Einstein
Einstein Albert

?

In short, how can I check whether the criss cross cases exist in dataframe? The task at hand that I am trying to tackle is to locate all of the duplicates in terms of names in a dataframe. I am consider other cases (below), but they can be tackled separately.

  • Different last names, same first name
  • Same last name, different first name
  • Same last name, same first name

Thanks!

CodePudding user response:

You can create an ordered tuple of your columns:

data = {'First Name': ['John', 'John', 'Kennedy'],
        'Middle Name': ['Fitzgerald', 'Kennedy', 'John'],
        'Last Name': ['Kennedy', 'Fitzgerald', 'Fitzgerald']}
df = pd.DataFrame(data)

df['key'] = df.apply(lambda x: tuple(sorted(x)), axis=1)

Output result:

>>> df
  First Name Middle Name   Last Name                          key
0       John  Fitzgerald     Kennedy  (Fitzgerald, John, Kennedy)
1       John     Kennedy  Fitzgerald  (Fitzgerald, John, Kennedy)
2    Kennedy        John  Fitzgerald  (Fitzgerald, John, Kennedy)

Now you have a key to use duplicated or drop_duplicates or whatever you want.

CodePudding user response:

Doing the np.sort and check them with duplicated

df1 = df.copy()
df1[:] = np.sort(df1.values,1)
df1.duplicated(keep=False)
Out[144]: 
0    True
1    True
dtype: bool
  • Related