Home > Enterprise >  Snowflake As of Join query
Snowflake As of Join query

Time:02-18

How to do as of query in snowflake. i.e similar to a left-join except that we match on nearest key rather than equal keys.

TABLE A:

COL_A COL_B COL_C
30 One Three
20 One Two
10 One One
05 One Five

TABLE B:

COL_A COL_B COL_C
20 Two Three
10 Two One
00 Two Zero

Join TABLE A(Left) and TABLE B(Right) on Col_A and if there is a match then get the matched row, if there is no match then get the nearest row:

COL_A(Left) COL_B(Left) COL_C(Left) COL_A (Right) COL_B(Right) COL_C(Right)
30 One Three 20 Two Three
20 One Two 20 Two Three
10 One One 10 Two Three
05 One Five 00 Two Zero

CodePudding user response:

so with some CTE's to provide the data:

WITH table_a(col_a, col_b, col_c) AS (
    SELECT * FROM VALUES
        (30, 'One','Three'),
        (20, 'One','Two'),
        (10, 'One','One'),
        (05, 'One','Five')
), table_b(col_a, col_b, col_c) AS (
    SELECT * FROM VALUES
        (20, 'Two','Three'),
        (10, 'Two','One'),
        (00, 'Two','Zero')
)

then a cross join with a QUALIFY to throw away the not nearests:

WITH table_a(col_a, col_b, col_c) AS (
    SELECT * FROM VALUES
        (30, 'One','Three'),
        (20, 'One','Two'),
        (10, 'One','One'),
        (05, 'One','Five')
), table_b(col_a, col_b, col_c) AS (
    SELECT * FROM VALUES
        (20, 'Two','Three'),
        (10, 'Two','One'),
        (00, 'Two','Zero')
)
SELECT 
    a.*, 
    b.*
FROM table_a as a
CROSS JOIN table_b as b
QUALIFY row_number() over(partition by a.col_a order by abs(a.col_a - b.col_a)) = 1
ORDER BY 1 desc;

gives unstables results, because 10 and 0 are the same distance from 5...

COL_A COL_B COL_C COL_A COL_B COL_C
30 One Three 20 Two Three
20 One Two 20 Two Three
10 One One 10 Two One
5 One Five 10 Two One

Lets see why:

So given the abs(a-b) is equal, if we add a bias that is a tenth of the smallest value that will be in col_a col_b we can move the preference in one direction

SELECT a.col_a, b.col_a
    ,a.col_a - b.col_a as sub_1
    ,b.col_a - a.col_a as sub_2
    ,abs(sub_1) as abs_1
    ,abs(sub_2) as abs_2
    ,abs(sub_1 0.01) as abs_1b
    ,abs(sub_2 0.01) as abs_2b
    ,row_number() over(partition by a.col_a order by abs_1) as rn_1_simple
    ,row_number() over(partition by a.col_a order by abs_2) as rn_2_simple
    ,row_number() over(partition by a.col_a order by abs_1b) as rn_1b
    ,row_number() over(partition by a.col_a order by abs_2b) as rn_2b
FROM table_a as a
CROSS JOIN table_b as b
ORDER BY a.col_a desc, b.col_a desc

gives a lovely table:

COL_A COL_A SUB_1 SUB_2 ABS_1 ABS_2 ABS_1B ABS_2B RN_SIMPLE RN_SIMPLE RN_1B RN_2B
30 20 10 -10 10 10 10.01 9.99 1 1 1 1
30 10 20 -20 20 20 20.01 19.99 2 2 2 2
30 0 30 -30 30 30 30.01 29.99 3 3 3 3
20 20 0 0 0 0 0.01 0.01 1 1 1 1
20 10 10 -10 10 10 10.01 9.99 2 2 2 2
20 0 20 -20 20 20 20.01 19.99 3 3 3 3
10 20 -10 10 10 10 9.99 10.01 2 2 2 3
10 10 0 0 0 0 0.01 0.01 1 1 1 1
10 0 10 -10 10 10 10.01 9.99 3 3 3 2
5 20 -15 15 15 15 14.99 15.01 3 3 3 3
5 10 -5 5 5 5 4.99 5.01 1 1 1 2
5 0 5 -5 5 5 5.01 4.99 2 2 2 1

we see that to get the sort order you implied we should use RN_2B which is abs(b.col_a - a.col_a 0.01)

So now with:

SELECT a.*, b.*
FROM table_a as a
CROSS JOIN table_b as b
QUALIFY row_number() over(partition by a.col_a order by abs(b.col_a - a.col_a 0.01))= 1
ORDER BY a.col_a desc;

we get

COL_A COL_B COL_C COL_A COL_B COL_C
30 One Three 20 Two Three
20 One Two 20 Two Three
10 One One 10 Two One
5 One Five 0 Two Zero

Now lets talk about why that is bad

So if you have 100K rows this is ugly and you very much want to avoid this. The first thing is way to put an equi-join (a.x = b. y) on a JOIN/ON instead of the CROSS JOIN to limit the permutaions.

If you literally cannot do this you still for do the equi join in one stage, and then fill in the gaps with a LEFT JOIN

like this:

SELECT a.*,
    nvl(b.col_a, c.col_a) as col_a,
    nvl(b.col_b, c.col_b) as col_b,
    nvl(b.col_c, c.col_c) as col_c
FROM table_a as a
LEFT JOIN table_b as b ON a.col_a = b.col_a
LEFT JOIN table_b as c ON b.col_a is null
QUALIFY row_number() over(partition by a.col_a order by abs(c.col_a - a.col_a 0.01))= 1
ORDER BY a.col_a desc

hybrid:

mixing David's LAG/LEAD idea with my double join:

SELECT a.col_a, b.col_a, c.col_a
FROM table_a as a
LEFT JOIN table_b as b ON a.col_a = b.col_a
LEFT JOIN (
    SELECT *,
        LAG(col_a) over(order by col_a) as p_col_a,
        LEAD(col_a) over(order by col_a) as n_col_a
    FROM table_b
) as c ON b.col_a is null AND a.col_a between nvl(c.p_col_a, a.col_a) and nvl(n_col_a, a.col_a)
ORDER BY a.col_a desc

gives:

COL_A COL_A COL_A
30 20
20 20
10 10
5 0
5 10

shows how 10, 20 are match, 5 is pair with 2 values (1 above, 1 below) and 30 is only paired with one value below.

Then putting the QUALIFY on that:

SELECT a.*,
   nvl(b.col_a, c.col_a) as col_a,
   nvl(b.col_b, c.col_b) as col_b,
   nvl(b.col_c, c.col_c) as col_c
FROM table_a as a
LEFT JOIN table_b as b ON a.col_a = b.col_a
LEFT JOIN (
    SELECT *,
        LAG(col_a) over(order by col_a) as p_col_a,
        LEAD(col_a) over(order by col_a) as n_col_a
    FROM table_b
) as c ON b.col_a is null AND a.col_a between nvl(c.p_col_a, a.col_a) and nvl(n_col_a, a.col_a)
QUALIFY row_number() over(partition by a.col_a order by abs(c.col_a - a.col_a 0.01))= 1
ORDER BY a.col_a desc;

gives the expected results again, with the least amount of cross joins, row push through the row_number()

CodePudding user response:

Here's how to join records from TABLE_A to the same ID column OR the next smallest ID column. This is a slightly rephrased question, but it's a much simpler answer.

with cte_b as (
    select
    COL_A, 
    lead(COL_A) over (order by COL_A) b.next_col_a
    COL_B,
    COL_C
    from table_b
)

SELECT
    a.COL_A,
    a.COL_B,
    a.COL_C,
    b.COL_A,
    b.COL_B,
    b.COL_C,
from table a
join cte_b b
    on a.COL_A >= b.COL_A and a.COL_A < b.next_col_a

This is simpler because you don't have to do the math on which ID is "closest" to a given ID, which can depend on the distribution of your IDs in the table.


If you REALLY NEED "closest" then you'll need to also include the lag as well as the lead.

lag(COL_A) over(order by COL_A) b.prev_col_a

and then implement your own logic on how to join. You'd need to do some more testing with your data, and decide how you want to handle edge cases (like the largest/smallest ID in your table)

I think you could find the averages between your IDs and it could look something like this.

ON a.COL_A >= ((prev_col_a   col_a) / 2) and a.COL_A < ((col_a   next_col_a)/ 2)

I do feel like there might be a more elegant answer out there on what the join condition should look like, but I haven't found one from a little bit of searching.

  • Related