Home > Software design >  Snowflake SQL - Find missing values within a variable VARCHAR range, with multiple table join
Snowflake SQL - Find missing values within a variable VARCHAR range, with multiple table join

Time:05-29

Ok, so I've done a lot of digging and trying all sorts of snippets and solutions for other people's queries, but I just can't seem to get any traction here. Also, only started trying sql 2 weeks ago, so pardon the beginner...

I have an ITEMS table with ITEM_ID and GROUP_ID, and GRP table with GROUP_ID and GROUP_NAME

Sometimes there are missing ITEM_ID's within a min and max (not all groups start at 1 unfortunately) and they are VARCHAR

I want to be able to search by GROUP_NAME for missing ITEM_ID's

I've been trying to use:

with nums as (
SELECT 
seq4() 1 as id
  from table(generator(rowcount => 10)) 
)

SELECT nums.id

To get a list of 1 to however many rows, but need this to be a min and max for what the ITEM_ID is. I tried to put in a CAST or even just a min(item.ITEM_ID) but am running into issues because the ITEMS table is already huge, hence the sorting by GROUP_NAME

I'm sorry I don't have a lot of code examples, because I mostly can't get anything to run. But here is how I have been joining the ITEMS and GRP tables to search by name:

SELECT
    item.NAME, item.ITEM_ID, grp.NAME
FROM
    API_ITEMS item
    LEFT JOIN API_GROUPS grp ON item.group_id = grp.id
WHERE
    grp.NAME = 'silverware'
GROUP by item.ITEM_ID ASC

I figure a lot of this is just because I'm super green with sql, but basically I'm trying to search GROUP_NAME of ITEMS to find missing ITEM_IDS

In my API_ITEMS table I have all items of all groups, I want to filter by just the grp.NAME, and find which ITEM_IDs are missing from the min(item.ITEM_ID), max(item.ITEM_ID). Not sure if I need to CAST as INT for the min/max, or if varchar works here just the same. I was thinking if I could join it to a seq4() that goes from min -> max, i could find null entries where there is a seq4() entry but no ITEM_ID entry

1 Example of this is a joined table I have with

ITEM_ID   grp.NAME
1         silverware
2         silverware
3         silverware
4         silverware
5000      silverware

I've found a gap script that will output gap starts at 5, ends at 4999, but that didn't work all of my grp.NAME

In this example, what my thoughts are is that I can create a seq4() as nums table of 5000 items. Join the item table on item.ITEM_ID = nums.ID to have a

nums.ID      ITEM_ID
1            1
2            2
3            3
4            4
5            null
6            null
...
5000         5000

I'd like an output where it is each ITEM.ID that is missing e.g.

MISSING_ITEM_ID
5
6
7
...
4999

Thanks to anyone that's willing to pick this up, hold my hand, and help me out here.

I'm on snowflake, I don't have access to Tally tables...

CodePudding user response:

So following your example, I have made some CTE's (that could be tables, but they are interchangeable), have chosen smaller number ranges just so the output is not too long

with items(item_id, item_name, group_id ) as (
    select * from values
    (1, 'item1', 100),
    (2, 'item2', 100),
    (3, 'item3', 100),
    (4, 'item4', 100),
    (9, 'item31', 100),
    (1, 'item1-101', 101)
), groups(group_id, name) as (
    select * from values
    (100, 'silverware'),
    (101, 'leadware')
)

and joining those to find the 'silverware' item, like you have

select i.item_id,
    i.item_name,
    g.name
from items as i
join groups as g
    on i.group_id = g.group_id
where g.name = 'silverware'    

gives:

ITEM_ID ITEM_NAME NAME
1 item1 silverware
2 item2 silverware
3 item3 silverware
4 item4 silverware
9 item31 silverware

now you want a CTE with a range of numbers, for this you must use a ROW_NUMBER function as SEQx() can have gaps in the output, if you want just different numbers that go upwards SEQ is faster, but if you want to have no holes, SEQ will burn you randomly.

So lets swap that nums CTE to ROW_NUMBER, and pick a large number that is bigger than any range your will ever have. In the real world I would make a table that has 1 million rows in it, and then just use that.

with nums as (
    SELECT 
        row_number()over(order by null) as id
    from table(generator(rowcount => 5000)) 
)

so now we have our large collection of id in nums and our wanted data.

  • so now we have to things to do, join the range to our data, and prune the out of range values..
  • find the gaps

so with these in place:

with nums as (
    SELECT 
        row_number()over(order by null) as id
    from table(generator(rowcount => 5000)) 
), wanted_data as (
    select i.item_id,
        i.item_name,
        i.group_id,
        g.name
    from items as i
    join groups as g
        on i.group_id = g.group_id
    where g.name = 'silverware'   
), ranges as (
    select group_id,
        min(item_id) as min_id,
        max(item_id) as max_id
    from wanted_data
    group by 1
)

this shows the range join, and the first half of the left join

select 
    n.id,
    r.group_id,
    w.*
from nums as n
join ranges as r
    on n.id between r.min_id and r.max_id
left join wanted_data as w
    on w.item_id = n.id
ID GROUP_ID ITEM_ID ITEM_NAME GROUP_ID_2 NAME
1 100 1 item1 100 silverware
2 100 2 item2 100 silverware
3 100 3 item3 100 silverware
4 100 4 item4 100 silverware
5 100 null null null null
6 100 null null null null
7 100 null null null null
8 100 null null null null
9 100 9 item31 100 silverware

now we can filter out the miss left joins, by adding to the end:

where w.item_id is null
ID GROUP_ID ITEM_ID ITEM_NAME GROUP_ID_2 NAME
5 100 null null null null
6 100 null null null null
7 100 null null null null
8 100 null null null null

so we don't need to select those failed join w.* values, but we are currently missing group name. You can ether re-join to groups table at this point (which is how I tend to do things in large workflow to avoid carrying too much values through the transformation stages). Or just carry it.

thus all the SQL together:

with items(item_id, item_name, group_id ) as (
    select * from values
    (1, 'item1', 100),
    (2, 'item2', 100),
    (3, 'item3', 100),
    (4, 'item4', 100),
    (9, 'item31', 100),
    (1, 'item1-101', 101)
), groups(group_id, name) as (
    select * from values
    (100, 'silverware'),
    (101, 'leadware')
), nums as (
    SELECT 
        row_number()over(order by null) as id
    from table(generator(rowcount => 5000)) 
), wanted_data as (
    select i.item_id,
        i.item_name,
        i.group_id,
        g.name
    from items as i
    join groups as g
        on i.group_id = g.group_id
    where g.name = 'silverware'   
), ranges as (
    select group_id,
        name as group_name,
        min(item_id) as min_id,
        max(item_id) as max_id
    from wanted_data
    group by 1,2
)
select 
    n.id,
    r.group_id,
    r.group_name
from nums as n
join ranges as r
    on n.id between r.min_id and r.max_id
left join wanted_data as w
    on w.item_id = n.id
where w.item_id is null

gives:

ID GROUP_ID GROUP_NAME
5 100 silverware
6 100 silverware
7 100 silverware
8 100 silverware

The other range way:

now if numeric space of you id's is very high, have a huge nums table can be gross.

with items(item_id, item_name, group_id ) as (
    select * from values
    (1000001, 'item1', 100),
    (1000002, 'item2', 100),
    (1000003, 'item3', 100),
    (1000004, 'item4', 100),
    (1000009, 'item31', 100),
    (2000001, 'item1-101', 101)
)

then with a change to nums to generate 0

), nums as (
    SELECT 
        row_number()over(order by null)-1 as id
    from table(generator(rowcount => 5000)) 
)

you can use a smaller range table, and just use relative math:

select 
    n.id r.min_id as id,
    r.group_id,
    r.group_name
from nums as n
join ranges as r
    on n.id <= r.span
left join wanted_data as w
    on w.item_id = n.id r.min_id
where w.item_id is null

to get:

ID GROUP_ID GROUP_NAME
1000005 100 silverware
1000006 100 silverware
1000007 100 silverware
1000008 100 silverware

CodePudding user response:

Not sure if this is exactly what you are looking for but you can use the uniform function to provide a Min and Max threshold for the random function. In this example you will get a MIN of 1 and MAX of 10 for each of the 10 rows created.

 SELECT
   uniform(1, 10, random(1)) as ITEM_ID
from table(generator(rowcount=>10));

Uniform documentation: https://docs.snowflake.com/en/sql-reference/functions/uniform.html

  • Related