Snowflake As of Join query-CodePudding

How to do as of query in snowflake. i.e similar to a left-join except that we match on nearest key rather than equal keys.

TABLE A:

COL_A	COL_B	COL_C
30	One	Three
20	One	Two
10	One	One
05	One	Five

TABLE B:

COL_A	COL_B	COL_C
20	Two	Three
10	Two	One
00	Two	Zero

Join TABLE A(Left) and TABLE B(Right) on Col_A and if there is a match then get the matched row, if there is no match then get the nearest row:

COL_A(Left)	COL_B(Left)	COL_C(Left)	COL_A (Right)	COL_B(Right)	COL_C(Right)
30	One	Three	20	Two	Three
20	One	Two	20	Two	Three
10	One	One	10	Two	Three
05	One	Five	00	Two	Zero

CodePudding user response：

so with some CTE's to provide the data:

WITH table_a(col_a, col_b, col_c) AS (
    SELECT * FROM VALUES
        (30, 'One','Three'),
        (20, 'One','Two'),
        (10, 'One','One'),
        (05, 'One','Five')
), table_b(col_a, col_b, col_c) AS (
    SELECT * FROM VALUES
        (20, 'Two','Three'),
        (10, 'Two','One'),
        (00, 'Two','Zero')
)

then a cross join with a QUALIFY to throw away the not nearests:

WITH table_a(col_a, col_b, col_c) AS (
    SELECT * FROM VALUES
        (30, 'One','Three'),
        (20, 'One','Two'),
        (10, 'One','One'),
        (05, 'One','Five')
), table_b(col_a, col_b, col_c) AS (
    SELECT * FROM VALUES
        (20, 'Two','Three'),
        (10, 'Two','One'),
        (00, 'Two','Zero')
)
SELECT 
    a.*, 
    b.*
FROM table_a as a
CROSS JOIN table_b as b
QUALIFY row_number() over(partition by a.col_a order by abs(a.col_a - b.col_a)) = 1
ORDER BY 1 desc;

gives unstables results, because 10 and 0 are the same distance from 5...

COL_A	COL_B	COL_C	COL_A	COL_B	COL_C
30	One	Three	20	Two	Three
20	One	Two	20	Two	Three
10	One	One	10	Two	One
5	One	Five	10	Two	One

Lets see why:

So given the abs(a-b) is equal, if we add a bias that is a tenth of the smallest value that will be in col_a col_b we can move the preference in one direction

SELECT a.col_a, b.col_a
    ,a.col_a - b.col_a as sub_1
    ,b.col_a - a.col_a as sub_2
    ,abs(sub_1) as abs_1
    ,abs(sub_2) as abs_2
    ,abs(sub_1 0.01) as abs_1b
    ,abs(sub_2 0.01) as abs_2b
    ,row_number() over(partition by a.col_a order by abs_1) as rn_1_simple
    ,row_number() over(partition by a.col_a order by abs_2) as rn_2_simple
    ,row_number() over(partition by a.col_a order by abs_1b) as rn_1b
    ,row_number() over(partition by a.col_a order by abs_2b) as rn_2b
FROM table_a as a
CROSS JOIN table_b as b
ORDER BY a.col_a desc, b.col_a desc

gives a lovely table:

COL_A	COL_A	SUB_1	SUB_2	ABS_1	ABS_2	ABS_1B	ABS_2B	RN_SIMPLE	RN_SIMPLE	RN_1B	RN_2B
30	20	10	-10	10	10	10.01	9.99	1	1	1	1
30	10	20	-20	20	20	20.01	19.99	2	2	2	2
30	0	30	-30	30	30	30.01	29.99	3	3	3	3
20	20	0	0	0	0	0.01	0.01	1	1	1	1
20	10	10	-10	10	10	10.01	9.99	2	2	2	2
20	0	20	-20	20	20	20.01	19.99	3	3	3	3
10	20	-10	10	10	10	9.99	10.01	2	2	2	3
10	10	0	0	0	0	0.01	0.01	1	1	1	1
10	0	10	-10	10	10	10.01	9.99	3	3	3	2
5	20	-15	15	15	15	14.99	15.01	3	3	3	3
5	10	-5	5	5	5	4.99	5.01	1	1	1	2
5	0	5	-5	5	5	5.01	4.99	2	2	2	1

we see that to get the sort order you implied we should use RN_2B which is abs(b.col_a - a.col_a 0.01)

So now with:

SELECT a.*, b.*
FROM table_a as a
CROSS JOIN table_b as b
QUALIFY row_number() over(partition by a.col_a order by abs(b.col_a - a.col_a 0.01))= 1
ORDER BY a.col_a desc;

we get

COL_A	COL_B	COL_C	COL_A	COL_B	COL_C
30	One	Three	20	Two	Three
20	One	Two	20	Two	Three
10	One	One	10	Two	One
5	One	Five	0	Two	Zero

Now lets talk about why that is bad

So if you have 100K rows this is ugly and you very much want to avoid this. The first thing is way to put an equi-join (a.x = b. y) on a JOIN/ON instead of the CROSS JOIN to limit the permutaions.

If you literally cannot do this you still for do the equi join in one stage, and then fill in the gaps with a LEFT JOIN

like this:

SELECT a.*,
    nvl(b.col_a, c.col_a) as col_a,
    nvl(b.col_b, c.col_b) as col_b,
    nvl(b.col_c, c.col_c) as col_c
FROM table_a as a
LEFT JOIN table_b as b ON a.col_a = b.col_a
LEFT JOIN table_b as c ON b.col_a is null
QUALIFY row_number() over(partition by a.col_a order by abs(c.col_a - a.col_a 0.01))= 1
ORDER BY a.col_a desc

hybrid:

mixing David's LAG/LEAD idea with my double join:

SELECT a.col_a, b.col_a, c.col_a
FROM table_a as a
LEFT JOIN table_b as b ON a.col_a = b.col_a
LEFT JOIN (
    SELECT *,
        LAG(col_a) over(order by col_a) as p_col_a,
        LEAD(col_a) over(order by col_a) as n_col_a
    FROM table_b
) as c ON b.col_a is null AND a.col_a between nvl(c.p_col_a, a.col_a) and nvl(n_col_a, a.col_a)
ORDER BY a.col_a desc

gives:

COL_A	COL_A	COL_A
30		20
20	20
10	10
5		0
5		10

shows how 10, 20 are match, 5 is pair with 2 values (1 above, 1 below) and 30 is only paired with one value below.

Then putting the QUALIFY on that:

SELECT a.*,
   nvl(b.col_a, c.col_a) as col_a,
   nvl(b.col_b, c.col_b) as col_b,
   nvl(b.col_c, c.col_c) as col_c
FROM table_a as a
LEFT JOIN table_b as b ON a.col_a = b.col_a
LEFT JOIN (
    SELECT *,
        LAG(col_a) over(order by col_a) as p_col_a,
        LEAD(col_a) over(order by col_a) as n_col_a
    FROM table_b
) as c ON b.col_a is null AND a.col_a between nvl(c.p_col_a, a.col_a) and nvl(n_col_a, a.col_a)
QUALIFY row_number() over(partition by a.col_a order by abs(c.col_a - a.col_a 0.01))= 1
ORDER BY a.col_a desc;

gives the expected results again, with the least amount of cross joins, row push through the row_number()

CodePudding user response：

Here's how to join records from TABLE_A to the same ID column OR the next smallest ID column. This is a slightly rephrased question, but it's a much simpler answer.

with cte_b as (
    select
    COL_A, 
    lead(COL_A) over (order by COL_A) b.next_col_a
    COL_B,
    COL_C
    from table_b
)

SELECT
    a.COL_A,
    a.COL_B,
    a.COL_C,
    b.COL_A,
    b.COL_B,
    b.COL_C,
from table a
join cte_b b
    on a.COL_A >= b.COL_A and a.COL_A < b.next_col_a

This is simpler because you don't have to do the math on which ID is "closest" to a given ID, which can depend on the distribution of your IDs in the table.

If you REALLY NEED "closest" then you'll need to also include the lag as well as the lead.

lag(COL_A) over(order by COL_A) b.prev_col_a

and then implement your own logic on how to join. You'd need to do some more testing with your data, and decide how you want to handle edge cases (like the largest/smallest ID in your table)

I think you could find the averages between your IDs and it could look something like this.

ON a.COL_A >= ((prev_col_a   col_a) / 2) and a.COL_A < ((col_a   next_col_a)/ 2)

I do feel like there might be a more elegant answer out there on what the join condition should look like, but I haven't found one from a little bit of searching.