I'm trying to get the bulk of my data cleaning taken care of using SQL queries. Sometimes we will just need a person to make the call, but in cases where, say, the only differences in data are ones where data is missing, I'd like to be able to just return a single row. Say we have the following data:
| FirstName | LastName | Email | Phone |
|:--------- |:-------- |:------------------ |:--------- |
| Joe | Bloggs | joe.bl@domain.com | 9412 3456 |
| Bob | Jones | bob.jones@abc.com | 9498 7654 |
| Bob | Jones | bob.jones@abc.com | |
| Mary | Flubbs | | |
| Mary | Flubbs | | 9411 1111 |
| Mary | Flubbs | marymary@corp.com | |
| Taylor | Minns | taylor@minnsco.com | 9499 8877 |
| Taylor | Minns | | 9477 8899 |
| Taylor | Minns | | |
Joe only has one record. Bob's records are identical except for a missing phone number. Mary's also have no discrepancies, just empties. Taylor has two different phone numbers, and will therefore still need two records, like so:
| FirstName | LastName | Email | Phone |
|:--------- |:-------- |:------------------ |:--------- |
| Joe | Bloggs | joe.bl@domain.com | 9412 3456 |
| Bob | Jones | bob.jones@abc.com | 9498 7654 |
| Mary | Flubbs | marymary@corp.com | 9411 1111 |
| Taylor | Minns | taylor@minnsco.com | 9499 8877 |
| Taylor | Minns | taylor@minnsco.com | 9477 8899 |
If I simply go SELECT DISTINCT...
it will show empties. If I use MAX()
or MIN()
I'll lose data. Is there a way to collapse records down like this without losing distinct data?
CodePudding user response:
You can either use Cross Apply
or Inner Join
. Assuming the same names are the same people
Cross Apply
declare @mytable table (Firstname varchar(50), LastName varchar(50), Email varchar(50), Phone varchar(50))
insert into @mytable
values
('Joe' , 'Bloggs' , '[email protected]' , '9412 3456') ,
('Bob' , 'Jones' , '[email protected]' , '9498 7654') ,
('Bob' , 'Jones' , '[email protected]' , '') ,
('Mary' , 'Flubbs' , '' , '') ,
('Mary' , 'Flubbs' , '' , '9411 1111') ,
('Mary' , 'Flubbs' , '[email protected]' , '') ,
('Taylor' , 'Minns' , '[email protected]' , '9499 8877') ,
('Taylor' , 'Minns' , '' , '9477 8899') ,
('Taylor' , 'Minns' , '' , '')
select distinct firstname, lastname, x.Email, y.Phone
from @mytable t
cross apply ( -- this statement will only filter emails
select email
from @mytable e
where nullif(email,'') is not null
and e.Firstname = t.Firstname
and e.LastName = t.LastName
) x
cross apply ( -- this statement will only filter phone
select phone
from @mytable p
where nullif(phone,'') is not null
and p.Firstname = t.Firstname
and p.LastName = t.LastName
) y
Inner Join
select distinct t.firstname, t.lastname, x.Email, y.Phone
from @mytable t
inner join (
select Firstname, LastName, email
from @mytable e
where nullif(email,'') is not null
) x on x.Firstname = t.Firstname
and x.LastName = t.LastName
inner join (
select Firstname, LastName, phone
from @mytable p
where nullif(phone,'') is not null
) y on y.Firstname = t.Firstname
and y.LastName = t.LastName
CodePudding user response:
You can use CASE
expression on Email
and Phone
to select between the current value (if there is one) or find a value for the current row if it's empty or null.
SELECT *
FROM (
SELECT
FirstName
, LastName
, CASE
WHEN ISNULL(p1.Email, '') = ''
THEN (
SELECT MAX(p2.Email)
FROM Person p2
WHERE
p1.FirstName = p2.FirstName
AND p1.LastName = p2.LastName
AND ISNULL(p2.Email, '') <> ''
)
ELSE p1.Email
END Email
, CASE
WHEN ISNULL(p1.Phone, '') = ''
THEN (
SELECT MAX(p2.Phone)
FROM Person p2
WHERE
p1.FirstName = p2.FirstName
AND p1.LastName = p2.LastName
AND ISNULL(p2.Phone, '') <> ''
)
ELSE p1.Phone
END Phone
FROM
Person p1
) e
GROUP BY
FirstName
, LastName
, Email
, Phone