I have a table:
Index | Key | Sequence | Coordinate |
---|---|---|---|
1 | 1001 | 1 | 18.2, 14.2 |
2 | 1001 | 2 | 18.2, 14.2 |
3 | 1001 | 3 | 18.2, 14.2 |
4 | 2001 | 1 | 25.6, 22.8 |
5 | 2001 | 2 | 25.6, 22.8 |
6 | 2001 | 3 | 25.6, 22.8 |
7 | 5004 | 1 | 25.6, 22.8 |
8 | 5004 | 2 | 25.6, 22.8 |
9 | 5004 | 3 | 25.6, 22.8 |
10 | 6895 | 1 | 36.2, 17.4 |
11 | 6895 | 2 | 36.2, 17.4 |
12 | 6895 | 3 | 36.2, 17.4 |
13 | 6650 | 1 | 18.2, 14.2 |
14 | 6650 | 2 | 18.2, 14.2 |
15 | 6650 | 3 | 18.2, 14.2 |
From the table I need to find out the keys (different keys) that have duplicate longitude and latitude. If a key having any of sequence has duplicate longitude and latitude all the sequences of the same key should be shown duplicate. (comparison between same key sequences does not happen). The Matched_with column shows the index of the Key that were matched
The output table should be:
Index | Key | Sequence | Coordinate | Duplicate | Matched_with |
---|---|---|---|---|---|
1 | 1001 | 1 | 18.2, 14.2 | Yes | 13,14,15 |
2 | 1001 | 2 | 18.2, 14.2 | Yes | 13,14,15 |
3 | 1001 | 3 | 18.2, 14.2 | Yes | 13,14,15 |
4 | 2001 | 1 | 25.6, 22.8 | Yes | 7,8,9 |
5 | 2001 | 2 | 25.6, 22.8 | Yes | 7,8,9 |
6 | 2001 | 3 | 25.6, 22.8 | Yes | 7,8,9 |
7 | 5004 | 1 | 25.6, 22.8 | Yes | 4,5,6 |
8 | 5004 | 2 | 25.6, 22.8 | Yes | 4,5,6 |
9 | 5004 | 3 | 25.6, 22.8 | Yes | 4,5,6 |
10 | 6895 | 1 | 36.2, 17.4 | No | |
11 | 6895 | 2 | 36.2, 17.4 | No | |
12 | 6895 | 3 | 36.2, 17.4 | No | |
13 | 6650 | 1 | 18.2, 14.2 | Yes | 1,2,3 |
14 | 6650 | 2 | 18.2, 14.2 | Yes | 1,2,3 |
15 | 6650 | 3 | 18.2, 14.2 | Yes | 1,2,3 |
CodePudding user response:
You can use a selective merge
combined with groupby.agg
:
cols = ['Key', 'Coordinate', 'Index']
df['Matched_with'] = df['Index'].map(df[cols]
.merge(df[cols]
.astype({'Index': 'str'})
.rename(columns={'Index': 'Matched_with'}),
on='Coordinate')
.loc[lambda d: d['Key_x'].ne(d['Key_y'])]
.groupby('Index')['Matched_with'].agg(','.join)
)
df['Duplicate'] = np.where(df['Matched_with'].notna(), 'Yes', 'No')
output:
Index Key Sequence Coordinate Matched_with Duplicate
0 1 1001 1 18.2, 14.2 13,14,15 Yes
1 2 1001 2 18.2, 14.2 13,14,15 Yes
2 3 1001 3 18.2, 14.2 13,14,15 Yes
3 4 2001 1 25.6, 22.8 7,8,9 Yes
4 5 2001 2 25.6, 22.8 7,8,9 Yes
5 6 2001 3 25.6, 22.8 7,8,9 Yes
6 7 5004 1 25.6, 22.8 4,5,6 Yes
7 8 5004 2 25.6, 22.8 4,5,6 Yes
8 9 5004 3 25.6, 22.8 4,5,6 Yes
9 10 6895 1 36.2, 17.4 NaN No
10 11 6895 2 36.2, 17.4 NaN No
11 12 6895 3 36.2, 17.4 NaN No
12 13 6650 1 18.2, 14.2 1,2,3 Yes
13 14 6650 2 18.2, 14.2 1,2,3 Yes
14 15 6650 3 18.2, 14.2 1,2,3 Yes
intermediate merge
:
Key_x Coordinate Index Key_y Matched_with
0 1001 18.2, 14.2 1 1001 1 # these will be removed
1 1001 18.2, 14.2 1 1001 2 #
2 1001 18.2, 14.2 1 1001 3 #
3 1001 18.2, 14.2 1 6650 13 # these will be kept
4 1001 18.2, 14.2 1 6650 14 #
.. ... ... ... ... ...
76 6895 36.2, 17.4 11 6895 11
77 6895 36.2, 17.4 11 6895 12
78 6895 36.2, 17.4 12 6895 10
79 6895 36.2, 17.4 12 6895 11
80 6895 36.2, 17.4 12 6895 12
[81 rows x 5 columns]