I need to find matches in a column of the same dataframe, that I'm doing is copying the dataframe and making a merge between de dataframe and its copy, but there is a way to avoid duplicates when 2 columns are equal and when the same result is shown previously. For Example:
df1 = pd.DataFrame()
df1['Id'] = ['001','002','003','004','005','006']
df1['Tel'] = ['123','456','789','123','852','123']
df2 = df1
df3 = pd.merge(df1,df2,on='Tel',how='inner')
The result is the following:
Id_x Tel Id_y
0 001 123 001
1 001 123 004
2 001 123 006
3 004 123 001
4 004 123 004
5 004 123 006
6 006 123 001
7 006 123 004
8 006 123 006
9 002 456 002
10 003 789 003
11 005 852 005
But I want the following result:
Id_x Tel Id_y
0 001 123 004
1 001 123 006
2 004 123 006
As you can see I need to ignore the results when Id_x == Id_y, but also I need to ignore when the same result is shown previously in different order, for example in the first result index 1 is the same result of the index 3, index 2 is the same result of index 6 and index 5 is the same result of index 7. So in the final result I just want index 1, index 2 and index 5.
Is there a way to do that?
Thank you so much!
CodePudding user response:
Sort of a roundabout solution but this should get you the answer you are looking for
df1 = pd.DataFrame()
df1['Id'] = ['001','002','003','004','005','006']
df1['Tel'] = ['123','456','789','123','852','123']
df1 = df1.drop_duplicates()
df2 = df1
df3 = pd.merge(df1,df2,on='Tel',how='inner')
df3 = df3[df3['Id_x'] != df3['Id_y']]
CodePudding user response:
Create tuples from Id_x
and Id_y
then sorted them and drop duplicates:
>>> df3[df3[['Id_x', 'Id_y']].apply(lambda x: sorted(tuple(x)), axis=1)
.duplicated(keep='last')]
Id_x Tel Id_y
1 001 123 004
2 001 123 006
5 004 123 006