Home > Net >  Avoid Duplicates Pandas Merge Between a Dataframe and its Copy
Avoid Duplicates Pandas Merge Between a Dataframe and its Copy

Time:09-30

I need to find matches in a column of the same dataframe, that I'm doing is copying the dataframe and making a merge between de dataframe and its copy, but there is a way to avoid duplicates when 2 columns are equal and when the same result is shown previously. For Example:

df1 = pd.DataFrame()
df1['Id'] = ['001','002','003','004','005','006']
df1['Tel'] = ['123','456','789','123','852','123']

df2 = df1

df3 = pd.merge(df1,df2,on='Tel',how='inner')

The result is the following:

    Id_x Tel Id_y
0   001 123 001
1   001 123 004
2   001 123 006
3   004 123 001
4   004 123 004
5   004 123 006
6   006 123 001
7   006 123 004
8   006 123 006
9   002 456 002
10  003 789 003
11  005 852 005

But I want the following result:

    Id_x Tel Id_y
0   001 123 004
1   001 123 006
2   004 123 006

As you can see I need to ignore the results when Id_x == Id_y, but also I need to ignore when the same result is shown previously in different order, for example in the first result index 1 is the same result of the index 3, index 2 is the same result of index 6 and index 5 is the same result of index 7. So in the final result I just want index 1, index 2 and index 5.

Is there a way to do that?

Thank you so much!

CodePudding user response:

Sort of a roundabout solution but this should get you the answer you are looking for

df1 = pd.DataFrame()
df1['Id'] = ['001','002','003','004','005','006']
df1['Tel'] = ['123','456','789','123','852','123']

df1 = df1.drop_duplicates()
df2 = df1
df3 = pd.merge(df1,df2,on='Tel',how='inner')
df3 = df3[df3['Id_x'] != df3['Id_y']]

CodePudding user response:

Create tuples from Id_x and Id_y then sorted them and drop duplicates:

>>> df3[df3[['Id_x', 'Id_y']].apply(lambda x: sorted(tuple(x)), axis=1) 
                             .duplicated(keep='last')]

  Id_x  Tel Id_y
1  001  123  004
2  001  123  006
5  004  123  006
  • Related