Home > Back-end >  SQL to deal with Duplicate Record Management
SQL to deal with Duplicate Record Management

Time:02-08

I have an Oracle database application which creates lists of duplicate records for people to review, with users being able to deduplcate, leave or mark the record as being different to each other (legitimate duplicates). I need to report on the number of records in this queue. The challenge is the legitimate duplicates as these are held on a second table of data and are held as two rows for each pair marked up (a≠b). However as duplicate groups can contain a number of duplicates, and records could be marked up a being a legitimate duplicate more than once I need to somehow remove the records that are marked up as legitimate duplicates from all the other records in a candidate group. Hopefully that is making sense.

So a simplified view of the two tables would be: Duplicate Candidates

Group   Key
Dup-1   123
Dup-1   234
Dup-2   123
Dup-2   345
Dup-2   567
Dup-3   234
Dup-3   567
Dup-4   123
Dup-4   567
Dup-4   235

Legitimate Duplicates:

Group   Key
A       123
A       234
B       345
B       456
C       123
C       567
D       123
D       235

The results I would like to return from this example would be:

Duplicate Candidates

Group   Key
Dup-2   123
Dup-2   345
Dup-2   567
Dup-3   234
Dup-3   567
Dup-4   567
Dup-4   235

Dup-1 would not be returned as Legitimate Group A has both Keys, Dup-2 would be returned as while both Key 123 and 345 are marked up as legitimate duplicates they are not currently marked as different to each other. Dup-3 again should be returned as the two records are not marked as legitimate duplicates. Finally the row Dup-4 123 should not be returned as it is marked up as legitimate duplicate to both of the other records in the group, but they should be returned as they are not marked a legitimate duplicates of each other.

I really need to carry this out in SQL as I will feed this data into a reporting solution (Business Objects or Tablaux) directly. Is anyone able to give me a nudge in the right direction on this. Unfortunately our software is completely black box so I cannot reverse engineer this from the code that deals with this for users.

CodePudding user response:

Using Exists. Return the Candidate if at least one other member of the group is not in any group of Legitimates where the tested Candidate is a member too.

select * 
from Candidates c 
where exists (
    select 1 
    from Candidates c2
    where c2.Grp = c.Grp and c2.Key <> c.Key 
       and not exists (
          select 1 
          from Legitimate l 
          where l.Key = c.Key 
            and exists (
                select 1
                from Legitimate l2
                where l2.Grp = l.Grp and l2.Key = c2.Key
            )
      )
   )
order by Grp, Key

CodePudding user response:

Generate pairs of duplicate candidates using:

SELECT "GROUP",
       PRIOR key AS key1,
       key       AS key2
FROM   duplicate_candidates
WHERE  LEVEL = 2
CONNECT BY PRIOR "GROUP" = "GROUP"
AND    PRIOR key < key

and pairs of legitimate candidates using:

SELECT MIN(key), MAX(key)
FROM   legitimate_duplicates
GROUP BY "GROUP"

Then you can find the duplicate pairs that are not in the legitimate candidate pairs and UNPIVOT the pairs and find the DISTINCT keys. In a single query:

SELECT DISTINCT
       "GROUP",
       key
FROM   (
  SELECT "GROUP",
         PRIOR key AS key1,
         key       AS key2
  FROM   duplicate_candidates
  WHERE  LEVEL = 2
  AND    (PRIOR key, key) NOT IN (SELECT MIN(key), MAX(key)
                                  FROM   legitimate_duplicates
                                  GROUP BY "GROUP")
  CONNECT BY PRIOR "GROUP" = "GROUP"
  AND    PRIOR key < key
)
UNPIVOT (key FOR id IN (key1, key2))

Which, for the sample data:

CREATE TABLE duplicate_candidates ("GROUP", Key) AS
SELECT 'Dup-1', 123 FROM DUAL UNION ALL
SELECT 'Dup-1', 234 FROM DUAL UNION ALL
SELECT 'Dup-2', 123 FROM DUAL UNION ALL
SELECT 'Dup-2', 345 FROM DUAL UNION ALL
SELECT 'Dup-2', 567 FROM DUAL UNION ALL
SELECT 'Dup-3', 234 FROM DUAL UNION ALL
SELECT 'Dup-3', 567 FROM DUAL UNION ALL
SELECT 'Dup-4', 123 FROM DUAL UNION ALL
SELECT 'Dup-4', 567 FROM DUAL UNION ALL
SELECT 'Dup-4', 235 FROM DUAL;

CREATE TABLE Legitimate_Duplicates ("GROUP", Key) AS
SELECT 'A', 123 FROM DUAL UNION ALL
SELECT 'A', 234 FROM DUAL UNION ALL
SELECT 'B', 345 FROM DUAL UNION ALL
SELECT 'B', 456 FROM DUAL UNION ALL
SELECT 'C', 123 FROM DUAL UNION ALL
SELECT 'C', 567 FROM DUAL UNION ALL
SELECT 'D', 123 FROM DUAL UNION ALL
SELECT 'D', 235 FROM DUAL;

Note: GROUP is a reserved word and cannot be an unquoted identifier. It would be better to use a different name for the columns but you can use a quoted identifier (but its not best practice).

Outputs:

GROUP KEY
Dup-2 123
Dup-2 345
Dup-2 567
Dup-3 234
Dup-3 567
Dup-4 235
Dup-4 567

db<>fiddle here

  •  Tags:  
  • Related