So I have a toughie here I've been wracking my brain on for a while.
Let's say I have a table as follows:
ID Group Timestamp Data
001 A 2021-04-13 12:51:12.063 content121
001 A-Direct 2021-04-13 12:52:13.063 content121
002 A-Direct 2021-04-13 12:50:14.063 content133
003 B-Direct 2021-04-13 12:55:12.063 content132
003 B 2021-04-13 12:56:11.063 content142
003 BA 2021-04-13 12:57:22.063 content153
004 D 2021-04-13 12:10:23.063 content113
004 C 2021-04-13 12:11:43.063 content144
005 C 2021-04-13 12:12:12.063 content111
005 A 2021-04-13 12:13:23.063 content100
005 D-Direct 2021-04-13 12:15:23.063 content121
006 A 2021-04-13 12:51:12.063 content121
006 B-Direct 2021-04-13 12:52:13.063 content121
007 A-Direct 2021-04-13 12:51:12.063 content121
007 A 2021-04-13 12:52:13.063 content121
008 B-Direct 2021-04-13 12:55:12.063 content132
008 B 2021-04-13 12:56:11.063 content142
008 B-Direct 2021-04-13 12:57:22.063 content153
009 B-Direct 2021-04-13 12:55:12.063 content132
009 C-Direct 2021-04-13 12:56:11.063 content142
009 D-Direct 2021-04-13 12:57:22.063 content153
So I need a table which contains one distinct ID as each row. But the selection criteria on which ID makes it is a little complicated.
The default selection should be the most recent entry, selected via TIMESTAMP
.
But the complexity comes from any ID's that have rows with -Direct
. Specifically, if a row has more than one entry and one is (for example) A
and the other is A-Direct
, we need the A
. This is only the case when the letter matches. As seen in the case for ID = 006
, we want B-Direct
since its counterpart is A
.
So at it's core the logic I am looking for is
If an ID has rows beginning with the same string, and one of them ends in -Direct
, substitute it with the -Direct
removed.
Final Output:
ID Group
001 A
002 A-Direct
003 BA
004 C
005 D-Direct
006 B-Direct
007 A
008 B
009 D-Direct
For added clarity, here is an outline of what happened to each ID:
- ID 001:
A
is followed byA-Direct
so we subA-Direct
forA
- ID 002:
A-Direct
is the only result, easy! - ID 003:
BA
,B
,B-Direct
are distinct, therefore we stick with the most recent,BA
. - ID 004: No direct, so we just take the most recent,
C
- ID 005:
D-Direct
is the most recent, but because there is noD
, we stick withD-Direct
- ID 006:
B-Direct
is the most recent, but because there is noB
, we stick withB-Direct
- ID 007:
A-Direct
is followed byA
so we simply take the most recent one, no problem. - ID 008:
B
andB-Direct
(x2) appear here, therefore we can useB
. - ID 009: All options are Direct, so we go with the most recent,
D-Direct
I can figure out how to get the most recent, but with the above criteria, I am unsure how to adjust
WITH data AS (
select d.*,
rank() over (
partition by ID
order by TIMESTAMP DESC
) as num
FROM table d
)
select ID, TIMESTAMP
from data
where num = 1
CodePudding user response:
I might start with something like the following. It isn't super-pretty so there might be a better solution, but I think it does what you want.
WITH data AS (
select d.*,
rank() over (
partition by ID
order by TIMESTAMP DESC
) as num
FROM table d
)
select ID,
CASE
WHEN EXISTS (SELECT * FROM table t WHERE t.id = d.id AND t.group || '-Direct' = d.group)
THEN replace(d.group, '-Direct')
ELSE d.group
END group
from data d
where num = 1
This gets the most recent one for each id (using your current code), but the case
/exists
statement in the select
clause checks whether there is a match without "-Direct" and, if so, we remove the "-Direct" from the string.
CodePudding user response:
Using:
SELECT ID
,CASE WHEN MIN(group) OVER(PARTITION BY ID, REPLACE(group, '-Direct'))
= MAX(group) OVER(PARTITION BY ID, REPLACE(group, '-Direct'))
THEN group
ELSE REPLACE(group, '-Direct')
END AS grp
FROM tab
QUALIFY RANK() OVER(PARTITION BY ID ORDER BY TIMESTAMP DESC) = 1;
Qualify ensures to take the lastest value per timestamp, and case expression handles '-Direct' override.