I have a customer table with a unique customers id, a customer phone number and a customer email. It could happen, that a unique person can have several customer id's, but the same email and/or the same phone number. Just like this:
| customer id | customer email | customer phone number | person id |
|:-------------|:----------------|:-----------------------|:-----------|
| 1 |[email protected] | 0111 111111 | 1 |
| 2 |[email protected] | 0999 999999 | 1 |
| 3 |[email protected] | 0111 111111 | 1 |
| 4 |[email protected] | 0555 555555 | 1 |
| 5 |[email protected] | 0555 555555 | 1 |
| 6 |[email protected] | 0333 333333 | 6 |
| 7 |[email protected] | 0333 333333 | 6 |
I would like to check with a SQL query in Snowflake, are there any mathes between email and phone number and create a new unique person id (e.g. customer ids 1 to 5 seem to be the same unique person because of the email and phone number and the customer id 6 and 7 are one unique person, because the phone numbers are the same). The smallest customer id of this unique person should always be taken as the person id.
How can I identify this way unique persons with a SQL query?
Thank you so much for your help & best Sophie
CodePudding user response:
This is a bit ugly, and may be improved with a window function or recursive CTE, but this is quick and dirty and generates the results you posted. It requires two passes over the data because your result set shows a transitive relationship on the data.
-- Test table
insert into
customer (id, email, phone, person_id)
values
(1, '[email protected]', '0111 111111', null),(2, '[email protected]', '0999 999999', null),
(3, '[email protected]', '0111 111111', null),(4, '[email protected]', '0555 555555', null),
(5, '[email protected]', '0555 555555', null),(6, '[email protected]', '0333 333333', null),(7, '[email protected]', '0333 333333', null);
-- CTE to reflect anchor query and 2 passes to produce the results
with cte as (select
A.id,
A.email,
A.phone,
min(B.id) as min_id
from
customer A,
customer B
where
(A.email = B.email)
or (A.phone = B.phone)
group by
A.id,
A.email,
A.phone),
t1 as (select email, case when min(id) < min(min_id) then min(id) else min(min_id) end min_id from cte
group by email),
t2 as (select phone, case when min(id) < min(min_id) then min(id) else min(min_id) end min_id from cte
group by phone),
t3 as
(select cte.id, cte.email, cte.phone, case when min(t1.min_id) < min(t2.min_id) then min(t1.min_id) else min(t2.min_id) end min_id
from cte, t1, t2
where (cte.email = t1.email) and (cte.phone = t2.phone)
group by 1,2,3),
t4 as (select email, case when min(id) < min(min_id) then min(id) else min(min_id) end min_id from t3
group by email),
t5 as (select phone, case when min(id) < min(min_id) then min(id) else min(min_id) end min_id from t3
group by phone)
select t3.id, t3.email, t3.phone, case when min(t4.min_id < t3.min_id) then min(t4.min_id) when min(t5.min_id < t3.min_id) then min(t5.min_id) else min(t3.min_id) end person_id
from t3, t4, t5
where (t4.email = t3.email) and (t5.phone = t3.phone)
group by 1,2,3
order by t3.id;
-- Results
ID EMAIL PHONE PERSON_ID
1 [email protected] 0111 111111 1
2 [email protected] 0999 999999 1
3 [email protected] 0111 111111 1
4 [email protected] 0555 555555 1
5 [email protected] 0555 555555 1
6 [email protected] 0333 333333 6
7 [email protected] 0333 333333 6