I have an Oracle database application which creates lists of duplicate records for people to review, with users being able to deduplcate, leave or mark the record as being different to each other (legitimate duplicates). I need to report on the number of records in this queue. The challenge is the legitimate duplicates as these are held on a second table of data and are held as two rows for each pair marked up (a≠b). However as duplicate groups can contain a number of duplicates, and records could be marked up a being a legitimate duplicate more than once I need to somehow remove the records that are marked up as legitimate duplicates from all the other records in a candidate group. Hopefully that is making sense.
So a simplified view of the two tables would be: Duplicate Candidates
Group Key
Dup-1 123
Dup-1 234
Dup-2 123
Dup-2 345
Dup-2 567
Dup-3 234
Dup-3 567
Dup-4 123
Dup-4 567
Dup-4 235
Legitimate Duplicates:
Group Key
A 123
A 234
B 345
B 456
C 123
C 567
D 123
D 235
The results I would like to return from this example would be:
Duplicate Candidates
Group Key
Dup-2 123
Dup-2 345
Dup-2 567
Dup-3 234
Dup-3 567
Dup-4 567
Dup-4 235
Dup-1 would not be returned as Legitimate Group A has both Keys, Dup-2 would be returned as while both Key 123 and 345 are marked up as legitimate duplicates they are not currently marked as different to each other. Dup-3 again should be returned as the two records are not marked as legitimate duplicates. Finally the row Dup-4 123 should not be returned as it is marked up as legitimate duplicate to both of the other records in the group, but they should be returned as they are not marked a legitimate duplicates of each other.
I really need to carry this out in SQL as I will feed this data into a reporting solution (Business Objects or Tablaux) directly. Is anyone able to give me a nudge in the right direction on this. Unfortunately our software is completely black box so I cannot reverse engineer this from the code that deals with this for users.
CodePudding user response:
Using Exists. Return the Candidate if at least one other member of the group is not in any group of Legitimates where the tested Candidate is a member too.
select *
from Candidates c
where exists (
select 1
from Candidates c2
where c2.Grp = c.Grp and c2.Key <> c.Key
and not exists (
select 1
from Legitimate l
where l.Key = c.Key
and exists (
select 1
from Legitimate l2
where l2.Grp = l.Grp and l2.Key = c2.Key
)
)
)
order by Grp, Key
CodePudding user response:
Generate pairs of duplicate candidates using:
SELECT "GROUP",
PRIOR key AS key1,
key AS key2
FROM duplicate_candidates
WHERE LEVEL = 2
CONNECT BY PRIOR "GROUP" = "GROUP"
AND PRIOR key < key
and pairs of legitimate candidates using:
SELECT MIN(key), MAX(key)
FROM legitimate_duplicates
GROUP BY "GROUP"
Then you can find the duplicate pairs that are not in the legitimate candidate pairs and UNPIVOT
the pairs and find the DISTINCT
keys. In a single query:
SELECT DISTINCT
"GROUP",
key
FROM (
SELECT "GROUP",
PRIOR key AS key1,
key AS key2
FROM duplicate_candidates
WHERE LEVEL = 2
AND (PRIOR key, key) NOT IN (SELECT MIN(key), MAX(key)
FROM legitimate_duplicates
GROUP BY "GROUP")
CONNECT BY PRIOR "GROUP" = "GROUP"
AND PRIOR key < key
)
UNPIVOT (key FOR id IN (key1, key2))
Which, for the sample data:
CREATE TABLE duplicate_candidates ("GROUP", Key) AS
SELECT 'Dup-1', 123 FROM DUAL UNION ALL
SELECT 'Dup-1', 234 FROM DUAL UNION ALL
SELECT 'Dup-2', 123 FROM DUAL UNION ALL
SELECT 'Dup-2', 345 FROM DUAL UNION ALL
SELECT 'Dup-2', 567 FROM DUAL UNION ALL
SELECT 'Dup-3', 234 FROM DUAL UNION ALL
SELECT 'Dup-3', 567 FROM DUAL UNION ALL
SELECT 'Dup-4', 123 FROM DUAL UNION ALL
SELECT 'Dup-4', 567 FROM DUAL UNION ALL
SELECT 'Dup-4', 235 FROM DUAL;
CREATE TABLE Legitimate_Duplicates ("GROUP", Key) AS
SELECT 'A', 123 FROM DUAL UNION ALL
SELECT 'A', 234 FROM DUAL UNION ALL
SELECT 'B', 345 FROM DUAL UNION ALL
SELECT 'B', 456 FROM DUAL UNION ALL
SELECT 'C', 123 FROM DUAL UNION ALL
SELECT 'C', 567 FROM DUAL UNION ALL
SELECT 'D', 123 FROM DUAL UNION ALL
SELECT 'D', 235 FROM DUAL;
Note: GROUP
is a reserved word and cannot be an unquoted identifier. It would be better to use a different name for the columns but you can use a quoted identifier (but its not best practice).
Outputs:
GROUP KEY Dup-2 123 Dup-2 345 Dup-2 567 Dup-3 234 Dup-3 567 Dup-4 235 Dup-4 567
db<>fiddle here