I work with a lot of linked data from multiple tables. As a result, I'm running into some challenges with deduplication and re-coding values into new columns in a more meaningful way.
My core data set is a list of person-level records as rows. However, the linked data include multiple rows per person based on the dates they've been booked into events, whether they've showed up or not, and whether they're a member of our organisation. There are usually multiple bookings. It is possible to lose membership status and continue to attend events/cancel/etc, but we are interested in whether or not they have ever been a member and if not, which is the highest level of contact they have ever had with our organisation.
In short: If they have ever been a member, that needs to take precedence.
select distinct
a.ticketnumber
a.id
-- (many additional columns from multiple tables here)
case
when b.Went_Member >=1 then 'Member'
when b.Went_NonMember >=1 then 'Attended but not member'
when b.Going_NonMember >=1 then 'Going but not member'
when b.OptOut='1' then 'Opt Out'
when b.Cancelled >=1 then 'Cancelled'
when c.MemberStatus = '9' then 'Member'
when c.MemberStatus = '6' then 'Attended but not member'
when c.DateBooked > current_timestamp then 'Going but not member'
when c.OptOut='1' then 'Opt out'
when c.MemberStatus = '8' then 'Cancelled'
end [NewMemberStatus]
from table1 a
left join TableWithMemberStatus1 b on a.id = b.id
left join TableWithMemberStatus2 c on a.id = c.id
-- (further left joins to additional tables here)
order by a.ticketnumber
Table b is more accurate because these are our internal records, whereas table c is from a third party. Annoyingly, the numbers in C aren't in the same meaningful order as we've decided so I can't have it select the highest value for each ID.
I was under the impression that CASE goes down the list of WHEN statements and returns the first matching value, but this will produce multiple rows. For example:
ID | NewMemberStatus |
---|---|
989898 | NULL |
989898 | Cancelled |
777777 | Member |
111111 | Cancelled |
111111 | Member |
I feel like maybe there is something missing in terms of ORDER BY or GROUP BY that I should be adding? I tried COALESCE with CASE inside and it didn't work. Should I be nesting some things in parentheses?
CodePudding user response:
In your query you are showing all rows (all bookings), because there is no WHERE
clause and no aggregation. But you only want one result row per person.
You want a person's best status from the internal table. If there is no entry for the person in the internal table, you want their best status from the third party table. You get the best statuses by aggregating the rows in the internal and third party tables by person. Then join to the person.
I am using status numbers, because these can be ordered (I use 1 for the best status (member), so I look for the minimum status). In the end I replace the number found with the related text (e.g. 'Member' for status 1).
select
p.*,
case coalesce(i.best_status, tp.best_status)
when 1 then 'Member'
when 2 then 'Attended but not member'
when 3 then 'Going but not member'
when 4 then 'Opt out'
when 5 then 'Cancelled'
else 'unknown'
end as status
from person p
left join
(
select
person_id,
min(case when went_member >= 1 then 1
when went_nonmember >= 1 then 2
when going_nonmember >= 1 then 3
when optout = 1 then 4
when cancelled >= 1 then 5
end) as best_status
from internal_table
group by person_id
) i on i.person_id = p.person_id
left join
(
select
person_id,
min(case when MemberStatus = 9 then 1
when MemberStatus = 6 then 2
when DateBooked > current_timestamp then 3
when optout = 1 then 4
when memberstatus = 8 then 5
end) as best_status
from thirdparty_table
group by person_id
) tp on tp.person_id = p.person_id
order by p.person_id;