Home > Mobile >  How to find duplicates records across multiple columns
How to find duplicates records across multiple columns

Time:07-21

I have a contacts table that has multiple emails and phone columns. My goal is to find duplicate records by comparing these emails and phone numbers and marking duplicates in groups.

id email1 email2 email 3 phone1 phone 2
1 email1 email2 email3 123 321
2 email2 email4 email5 234 432
3 email6 email7 email8 234
4 email9 email10 email11
5 email12 email9

In the example above, records 1 and 2 are duplicates because they are holding the same email email2, and records 2 and 3 are duplicates because they have the same phone 234.
Because record 2 duplicates with both 1 and 3, records 1, 2, 3 should be marked in one duplication group.

The expected output should be like this:

dup_grpup record_id
A 1
A 2
A 3
B 4
B 5

I've been trying for 2 days and still haven't come up with a solution. I can do this with programing language, but I'm interested in implementing it in SQL.

Any suggestions and ideas are appreciated.

CodePudding user response:

In below query -

  • First CTE with data - is only data.
  • Second CTE cte_1 combines all email and phone columns into an array.
  • Third CTE cte_2 is the core of the entire query. It uses function arrays_overlap to compare current record with all other records in the table using self-join. In case of match it picks the least of two IDs that are matching. Column child in cte_2 creates the pivot or starting point for different hierarchical groups in the data-set.
  • Fourth CTE final_cte picks the minimum and creates the final hierarchical data-set which links each id (parent) via column chk (child) in final_cte.
  • Lastly, hierarchial query is used to get the top root for each element.

I had to use coalesce as function arrays_overlap](https://docs.snowflake.com/en/sql-reference/functions/arrays_overlap.html#arrays-overlap) is NULL-safe, so it returns true when comparing NULLs, so by using coalesce, I ensured that null records in across rows are unique.

with data (id,email1,email2,email3,phone1,phone2) as (
select * from values
(1,'email1','email2','email3',123,321),
(2,'email2','email4','email5',234,432),
(3,'email6','email7','email8',234,null),
(4,'email9','email10','email11',null,null),
(5,'email12','email9',null,null,null)
), cte_1 (id,arr) as
(select id,
array_construct(
coalesce(email1, id::string),
coalesce(email2, id::string),
coalesce(email3, id::string),
coalesce(phone1, id::number),
coalesce(phone2, id::number))
from data), cte_2 as (
select c1.id id1,c2.id id2,
case when arrays_overlap(c1.arr,c2.arr) then least(c1.id,c2.id) else null end chk,
case when (id1 = chk) then null else chk end child
from cte_1 c1 left join cte_1 c2
where c1.id != c2.id ), final_cte as (
select id1, 
min(child) chk
from cte_2
group by id1
order by id1
)
select id1, connect_by_root id1 as parent from final_cte 
start with chk is null 
connect by chk = prior id1 
order by id1;

ID1 PARENT
1 1
2 1
3 1
4 4
5 4

If you indeed need dup_group as A,B,C then add another query at the end; something like -

select id1, 
chr(65 conditional_change_event(parent) 
over (order by id1)) dup_group 
from cte_name;

CodePudding user response:

You could use some window functions and an unpivot to accomplish something like this. I'm using rank() in my example, which will give you a numeric dupe group. I am only doing emails in this example, but you could simply do a union with the same logic for phones:

with x as (
    SELECT id, email1, email2, email3
    FROM (VALUES (1,'email1','email2','email3'),
                 (2,'email2','email4','email5'),
                 (3,'email6','email7','email8'),
                 (4,'email9','email11','email10') y (id,email1,email2,email3)
         )
)
SELECT id, 
--     emails, 
       RANK() OVER (ORDER BY emails) as dupe_group
FROM x
UNPIVOT(emails for email in (email1,email2,email3))
QUALIFY COUNT(DISTINCT id) OVER (PARTITION BY emails) > 1
;

Please note that the cte at the beginning is just creating a dataset to test with. And I commented out emails in the output, since you don't have it in yours, but I used it to make sure my code was working.

  • Related