I have a contacts table that has multiple emails and phone columns. My goal is to find duplicate records by comparing these emails and phone numbers and marking duplicates in groups.
id | email1 | email2 | email 3 | phone1 | phone 2 |
---|---|---|---|---|---|
1 | email1 | email2 | email3 | 123 | 321 |
2 | email2 | email4 | email5 | 234 | 432 |
3 | email6 | email7 | email8 | 234 | |
4 | email9 | email10 | email11 | ||
5 | email12 | email9 |
In the example above, records 1
and 2
are duplicates because they are holding the same email email2
, and records 2
and 3
are duplicates because they have the same phone 234
.
Because record 2
duplicates with both 1
and 3
, records 1
, 2
, 3
should be marked in one duplication group.
The expected output should be like this:
dup_grpup | record_id |
---|---|
A | 1 |
A | 2 |
A | 3 |
B | 4 |
B | 5 |
I've been trying for 2 days and still haven't come up with a solution. I can do this with programing language, but I'm interested in implementing it in SQL.
Any suggestions and ideas are appreciated.
CodePudding user response:
In below query -
- First CTE
with data
- is only data. - Second CTE
cte_1
combines all email and phone columns into an array. - Third CTE
cte_2
is the core of the entire query. It uses functionarrays_overlap
to compare current record with all other records in the table using self-join. In case of match it picks the least of two IDs that are matching. Columnchild
incte_2
creates the pivot or starting point for different hierarchical groups in the data-set. - Fourth CTE
final_cte
picks the minimum and creates the final hierarchical data-set which links each id (parent) via column chk (child) infinal_cte
. - Lastly, hierarchial query is used to get the top root for each element.
I had to use coalesce
as function arrays_overlap](https://docs.snowflake.com/en/sql-reference/functions/arrays_overlap.html#arrays-overlap) is NULL-safe, so it returns true when comparing NULLs, so by using coalesce
, I ensured that null records in across rows are unique.
with data (id,email1,email2,email3,phone1,phone2) as (
select * from values
(1,'email1','email2','email3',123,321),
(2,'email2','email4','email5',234,432),
(3,'email6','email7','email8',234,null),
(4,'email9','email10','email11',null,null),
(5,'email12','email9',null,null,null)
), cte_1 (id,arr) as
(select id,
array_construct(
coalesce(email1, id::string),
coalesce(email2, id::string),
coalesce(email3, id::string),
coalesce(phone1, id::number),
coalesce(phone2, id::number))
from data), cte_2 as (
select c1.id id1,c2.id id2,
case when arrays_overlap(c1.arr,c2.arr) then least(c1.id,c2.id) else null end chk,
case when (id1 = chk) then null else chk end child
from cte_1 c1 left join cte_1 c2
where c1.id != c2.id ), final_cte as (
select id1,
min(child) chk
from cte_2
group by id1
order by id1
)
select id1, connect_by_root id1 as parent from final_cte
start with chk is null
connect by chk = prior id1
order by id1;
ID1 | PARENT |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 4 |
5 | 4 |
If you indeed need dup_group as A,B,C then add another query at the end; something like -
select id1,
chr(65 conditional_change_event(parent)
over (order by id1)) dup_group
from cte_name;
CodePudding user response:
You could use some window functions and an unpivot
to accomplish something like this. I'm using rank()
in my example, which will give you a numeric dupe group. I am only doing emails in this example, but you could simply do a union with the same logic for phones:
with x as (
SELECT id, email1, email2, email3
FROM (VALUES (1,'email1','email2','email3'),
(2,'email2','email4','email5'),
(3,'email6','email7','email8'),
(4,'email9','email11','email10') y (id,email1,email2,email3)
)
)
SELECT id,
-- emails,
RANK() OVER (ORDER BY emails) as dupe_group
FROM x
UNPIVOT(emails for email in (email1,email2,email3))
QUALIFY COUNT(DISTINCT id) OVER (PARTITION BY emails) > 1
;
Please note that the cte at the beginning is just creating a dataset to test with. And I commented out emails in the output, since you don't have it in yours, but I used it to make sure my code was working.