Home > database >  Checking and finding duplicate values
Checking and finding duplicate values

Time:10-19

I have a table:

Key Sequence Longitude Latitude
1001 1 18.2 14.2
1001 2 18.2 14.2
1001 3 18.2 14.2
2001 1 25.6 22.8
2001 2 25.6 22.8
2001 3 25.6 22.8
5004 1 25.6 22.8
5004 2 25.6 22.8
5004 3 25.6 22.8
6895 1 36.2 17.4
6895 2 36.2 17.4
6895 3 36.2 17.4
6650 1 18.2 14.2
6650 2 18.2 14.2
6650 3 18.2 14.2

From the table I need to find out the keys (different keys) that have duplicate longitude and latitude. If a key having any of sequence has duplicate longitude and latitude all the sequences of the same key should be shown duplicate. (comparison between same key sequences does not happen).

The output table should be:

Key Sequence Longitude Latitude Duplicate
1001 1 18.2 14.2 No
1001 2 18.2 14.2 No
1001 3 18.2 14.2 No
2001 1 25.6 22.8 No
2001 2 25.6 22.8 No
2001 3 25.6 22.8 No
5004 1 25.6 22.8 Yes
5004 2 25.6 22.8 Yes
5004 3 25.6 22.8 Yes
6895 1 36.2 17.4 No
6895 2 36.2 17.4 No
6895 3 36.2 17.4 No
6650 1 18.2 14.2 Yes
6650 2 18.2 14.2 Yes
6650 3 18.2 14.2 Yes

CodePudding user response:

First remove the duplicates per key, then check what is still duplicated between keys:

# get example of duplicated rows
s = (df
 .drop_duplicates(subset=['Key', 'Longitude', 'Latitude'])
 .duplicated(['Longitude', 'Latitude'])
 )

# extract the original keys and mark all matching rows
df['Duplicate'] = np.where(df['Key'].isin(df.loc[s[s].index, 'Key']), 'Yes', 'No')

print(df)

Output:

     Key  Sequence  Longitude  Latitude Duplicate
0   1001         1       18.2      14.2        No
1   1001         2       18.2      14.2        No
2   1001         3       18.2      14.2        No
3   2001         1       25.6      22.8        No
4   2001         2       25.6      22.8        No
5   2001         3       25.6      22.8        No
6   5004         1       25.6      22.8       Yes
7   5004         2       25.6      22.8       Yes
8   5004         3       25.6      22.8       Yes
9   6895         1       36.2      17.4        No
10  6895         2       36.2      17.4        No
11  6895         3       36.2      17.4        No
12  6650         1       18.2      14.2       Yes
13  6650         2       18.2      14.2       Yes
14  6650         3       18.2      14.2       Yes

CodePudding user response:

#first get the keys which are duplicate
key = list(df.duplicated(['Longitude', 'Latitude'])['Key'].unique())

#then put np.where to assign yes or no
df['Duplicate'] = np.where(df['Key'].isin(key), 'Yes', 'No')
  • Related