If I have some dataframe with Firs Name and Last Name columns, how can I check whether below case exist in dataframe and therefore extract all of the such cases:
First Name | Last Name |
---|---|
Albert | Einstein |
Einstein | Albert |
?
In short, how can I check whether the criss cross cases exist in dataframe? The task at hand that I am trying to tackle is to locate all of the duplicates in terms of names in a dataframe. I am consider other cases (below), but they can be tackled separately.
- Different last names, same first name
- Same last name, different first name
- Same last name, same first name
Thanks!
CodePudding user response:
You can create an ordered tuple of your columns:
data = {'First Name': ['John', 'John', 'Kennedy'],
'Middle Name': ['Fitzgerald', 'Kennedy', 'John'],
'Last Name': ['Kennedy', 'Fitzgerald', 'Fitzgerald']}
df = pd.DataFrame(data)
df['key'] = df.apply(lambda x: tuple(sorted(x)), axis=1)
Output result:
>>> df
First Name Middle Name Last Name key
0 John Fitzgerald Kennedy (Fitzgerald, John, Kennedy)
1 John Kennedy Fitzgerald (Fitzgerald, John, Kennedy)
2 Kennedy John Fitzgerald (Fitzgerald, John, Kennedy)
Now you have a key to use duplicated
or drop_duplicates
or whatever you want.
CodePudding user response:
Doing the np.sort
and check them with duplicated
df1 = df.copy()
df1[:] = np.sort(df1.values,1)
df1.duplicated(keep=False)
Out[144]:
0 True
1 True
dtype: bool