Home > database >  Compare and find duplicate
Compare and find duplicate

Time:10-19

I have a table:

Index Key Sequence Coordinate
1 1001 1 18.2, 14.2
2 1001 2 18.2, 14.2
3 1001 3 18.2, 14.2
4 2001 1 25.6, 22.8
5 2001 2 25.6, 22.8
6 2001 3 25.6, 22.8
7 5004 1 25.6, 22.8
8 5004 2 25.6, 22.8
9 5004 3 25.6, 22.8
10 6895 1 36.2, 17.4
11 6895 2 36.2, 17.4
12 6895 3 36.2, 17.4
13 6650 1 18.2, 14.2
14 6650 2 18.2, 14.2
15 6650 3 18.2, 14.2

From the table I need to find out the keys (different keys) that have duplicate longitude and latitude. If a key having any of sequence has duplicate longitude and latitude all the sequences of the same key should be shown duplicate. (comparison between same key sequences does not happen). The Matched_with column shows the index of the Key that were matched

The output table should be:

Index Key Sequence Coordinate Duplicate Matched_with
1 1001 1 18.2, 14.2 Yes 13,14,15
2 1001 2 18.2, 14.2 Yes 13,14,15
3 1001 3 18.2, 14.2 Yes 13,14,15
4 2001 1 25.6, 22.8 Yes 7,8,9
5 2001 2 25.6, 22.8 Yes 7,8,9
6 2001 3 25.6, 22.8 Yes 7,8,9
7 5004 1 25.6, 22.8 Yes 4,5,6
8 5004 2 25.6, 22.8 Yes 4,5,6
9 5004 3 25.6, 22.8 Yes 4,5,6
10 6895 1 36.2, 17.4 No
11 6895 2 36.2, 17.4 No
12 6895 3 36.2, 17.4 No
13 6650 1 18.2, 14.2 Yes 1,2,3
14 6650 2 18.2, 14.2 Yes 1,2,3
15 6650 3 18.2, 14.2 Yes 1,2,3

CodePudding user response:

You can use a selective merge combined with groupby.agg:

cols = ['Key', 'Coordinate', 'Index']
df['Matched_with'] = df['Index'].map(df[cols]
 .merge(df[cols]
        .astype({'Index': 'str'})
        .rename(columns={'Index': 'Matched_with'}),
        on='Coordinate')
 .loc[lambda d: d['Key_x'].ne(d['Key_y'])]
 .groupby('Index')['Matched_with'].agg(','.join)
)

df['Duplicate'] = np.where(df['Matched_with'].notna(), 'Yes', 'No')

output:

    Index   Key  Sequence  Coordinate Matched_with Duplicate
0       1  1001         1  18.2, 14.2     13,14,15       Yes
1       2  1001         2  18.2, 14.2     13,14,15       Yes
2       3  1001         3  18.2, 14.2     13,14,15       Yes
3       4  2001         1  25.6, 22.8        7,8,9       Yes
4       5  2001         2  25.6, 22.8        7,8,9       Yes
5       6  2001         3  25.6, 22.8        7,8,9       Yes
6       7  5004         1  25.6, 22.8        4,5,6       Yes
7       8  5004         2  25.6, 22.8        4,5,6       Yes
8       9  5004         3  25.6, 22.8        4,5,6       Yes
9      10  6895         1  36.2, 17.4          NaN        No
10     11  6895         2  36.2, 17.4          NaN        No
11     12  6895         3  36.2, 17.4          NaN        No
12     13  6650         1  18.2, 14.2        1,2,3       Yes
13     14  6650         2  18.2, 14.2        1,2,3       Yes
14     15  6650         3  18.2, 14.2        1,2,3       Yes
intermediate merge:
    Key_x  Coordinate  Index  Key_y Matched_with
0    1001  18.2, 14.2      1   1001            1  # these will be removed
1    1001  18.2, 14.2      1   1001            2  #
2    1001  18.2, 14.2      1   1001            3  #
3    1001  18.2, 14.2      1   6650           13      # these will be kept
4    1001  18.2, 14.2      1   6650           14      #
..    ...         ...    ...    ...          ...
76   6895  36.2, 17.4     11   6895           11
77   6895  36.2, 17.4     11   6895           12
78   6895  36.2, 17.4     12   6895           10
79   6895  36.2, 17.4     12   6895           11
80   6895  36.2, 17.4     12   6895           12

[81 rows x 5 columns]
  • Related