Selecting a row in SQL (Snowflake) based on String-CodePudding

So I have a toughie here I've been wracking my brain on for a while.

Let's say I have a table as follows:

ID      Group                 Timestamp   Data
001         A   2021-04-13 12:51:12.063   content121
001  A-Direct   2021-04-13 12:52:13.063   content121
002  A-Direct   2021-04-13 12:50:14.063   content133
003  B-Direct   2021-04-13 12:55:12.063   content132
003         B   2021-04-13 12:56:11.063   content142
003        BA   2021-04-13 12:57:22.063   content153
004         D   2021-04-13 12:10:23.063   content113
004         C   2021-04-13 12:11:43.063   content144
005         C   2021-04-13 12:12:12.063   content111
005         A   2021-04-13 12:13:23.063   content100
005  D-Direct   2021-04-13 12:15:23.063   content121
006         A   2021-04-13 12:51:12.063   content121
006  B-Direct   2021-04-13 12:52:13.063   content121
007  A-Direct   2021-04-13 12:51:12.063   content121
007         A   2021-04-13 12:52:13.063   content121
008  B-Direct   2021-04-13 12:55:12.063   content132
008         B   2021-04-13 12:56:11.063   content142
008  B-Direct   2021-04-13 12:57:22.063   content153
009  B-Direct   2021-04-13 12:55:12.063   content132
009  C-Direct   2021-04-13 12:56:11.063   content142
009  D-Direct   2021-04-13 12:57:22.063   content153

So I need a table which contains one distinct ID as each row. But the selection criteria on which ID makes it is a little complicated.

The default selection should be the most recent entry, selected via TIMESTAMP.

But the complexity comes from any ID's that have rows with -Direct. Specifically, if a row has more than one entry and one is (for example) A and the other is A-Direct, we need the A. This is only the case when the letter matches. As seen in the case for ID = 006, we want B-Direct since its counterpart is A.

So at it's core the logic I am looking for is

If an ID has rows beginning with the same string, and one of them ends in -Direct, substitute it with the -Direct removed.

Final Output:

ID      Group
001         A
002  A-Direct
003        BA
004         C
005  D-Direct
006  B-Direct
007         A
008         B
009  D-Direct

For added clarity, here is an outline of what happened to each ID:

ID 001: A is followed by A-Direct so we sub A-Direct for A
ID 002: A-Direct is the only result, easy!
ID 003: BA,B, B-Direct are distinct, therefore we stick with the most recent, BA.
ID 004: No direct, so we just take the most recent, C
ID 005: D-Direct is the most recent, but because there is no D , we stick with D-Direct
ID 006: B-Direct is the most recent, but because there is no B , we stick with B-Direct
ID 007: A-Direct is followed by A so we simply take the most recent one, no problem.
ID 008: Band B-Direct (x2) appear here, therefore we can use B.
ID 009: All options are Direct, so we go with the most recent, D-Direct

I can figure out how to get the most recent, but with the above criteria, I am unsure how to adjust

WITH data AS (
    select d.*,
        rank() over (
            partition by ID
            order by TIMESTAMP DESC
        ) as num
    FROM table d
)
select ID, TIMESTAMP
    from data
    where num = 1

CodePudding user response：

I might start with something like the following. It isn't super-pretty so there might be a better solution, but I think it does what you want.

WITH data AS (
    select d.*,
        rank() over (
            partition by ID
            order by TIMESTAMP DESC
        ) as num
    FROM table d
)
select ID, 
 CASE 
  WHEN EXISTS (SELECT * FROM table t WHERE t.id = d.id AND t.group || '-Direct' = d.group) 
   THEN replace(d.group, '-Direct') 
   ELSE d.group 
 END group
    from data d
    where num = 1

This gets the most recent one for each id (using your current code), but the case/exists statement in the select clause checks whether there is a match without "-Direct" and, if so, we remove the "-Direct" from the string.

CodePudding user response：

Using:

SELECT ID
   ,CASE WHEN MIN(group) OVER(PARTITION BY ID, REPLACE(group, '-Direct'))
             = MAX(group) OVER(PARTITION BY ID, REPLACE(group, '-Direct'))
         THEN group
         ELSE REPLACE(group, '-Direct')
    END AS grp
FROM tab
QUALIFY RANK() OVER(PARTITION BY ID ORDER BY TIMESTAMP DESC) = 1;

Qualify ensures to take the lastest value per timestamp, and case expression handles '-Direct' override.