Home > Back-end >  Choosing the most recent when joining in Snowflake (SQL)
Choosing the most recent when joining in Snowflake (SQL)

Time:11-19

So I have a list as follows:

Table 1

ID      TIMESTAMP                  GROUP
001     2021-04-01 12:51:12.063    A    
001     2021-04-04 12:51:12.063    G    
001     2021-04-14 10:47:03.022    B
002     2021-01-13 09:46:23.012    C
003     2021-09-10 03:32:53.043    D
004     2021-04-13 01:12:54.056    D
004     2021-04-13 11:12:26.054    A
004     2021-04-13 21:53:36.023    D
005     2021-04-01 13:53:13.023    F
005     2021-04-11 13:53:13.023    J
003     2022-04-13 20:32:11.011    G
006     2021-08-13 20:32:11.011    G

And I also have a list of events:

TABLE 2

EVENT    ID    TIMESTAMP              
eventA   001   2021-04-02 12:51:12.063
eventB   001   2021-04-13 12:51:12.063
eventA   002   2021-04-01 12:51:12.063 
eventA   002   2021-04-13 12:51:12.063      
eventA   002   2021-04-14 12:51:12.063
eventA   003   2021-10-17 12:51:12.063
eventB   005   2021-04-10 12:51:12.063
eventB   005   2021-04-21 12:51:12.063
eventA   006   2021-05-01 20:32:11.011

And my goal here is for every event in TABLE 2, I want to join the most recent entry from table 1 based on ID. If there are no preceding entries in Table 1, though they exist, they should be null on the join.

So in short, for every row in Table 2, we need to find the most recent group for that ID based on timestamp.

Final Result

EVENT    ID    TIMESTAMP                  group
eventA   001   2021-04-02 12:51:12.063        A
eventB   001   2021-04-13 12:51:12.063        G
eventA   002   2021-04-01 12:51:12.063     NULL
eventA   002   2021-04-13 12:51:12.063        C      
eventA   002   2021-04-14 12:51:12.063        C
eventA   003   2021-10-17 12:51:12.063        D
eventB   005   2021-04-10 12:51:12.063        F
eventB   005   2021-04-21 12:51:12.063        J
eventA   006   2021-05-01 20:32:11.011     NULL

CodePudding user response:

So if you do a LEFT JOIN based on prior (equal?) timestamps and then prune the overmatches to just the most recent with a QUALIFY this can be done with:

SELECT t2.event
    t2.id
    t2.timestamp
    t1.group
FROM table2 AS t2
LEFT JOIN table1 AS t1
    ON t2.id = t1.id AND t2.timestamp >= t1.timestamp
QUALIFY ROW_NUMBER() OVER (
        PARTITON BY t2.id, t2.timestamp 
        ORDER BY t1.timestamp DESC NULLS LAST
    ) = 1
ORDER BY 1,2,3;

this will work as long as Table2 has no duplicate ID, Timestamp values

CodePudding user response:

Window functions with QUALIFY ROW_NUMBER() work to get the latest row as Simeon shows. I've found that for this type of join (often called an AsOf join) if the tables are very large this join, find the max timestamp and rejoin approach usually completes faster than using a window function:

select J."EVENT", J.ID, J."TIMESTAMP", "GROUP" from
(select * from T2, 
    lateral (select max(T1."TIMESTAMP") TS from T1 where T1.ID = T2.ID and T1.TIMESTAMP < T2."TIMESTAMP")) J
left join T1 on J.TS = T1."TIMESTAMP"
;
  • Related