Ok, so I've done a lot of digging and trying all sorts of snippets and solutions for other people's queries, but I just can't seem to get any traction here. Also, only started trying sql 2 weeks ago, so pardon the beginner...
I have an ITEMS table with ITEM_ID and GROUP_ID, and GRP table with GROUP_ID and GROUP_NAME
Sometimes there are missing ITEM_ID's within a min and max (not all groups start at 1 unfortunately) and they are VARCHAR
I want to be able to search by GROUP_NAME for missing ITEM_ID's
I've been trying to use:
with nums as (
SELECT
seq4() 1 as id
from table(generator(rowcount => 10))
)
SELECT nums.id
To get a list of 1 to however many rows, but need this to be a min and max for what the ITEM_ID is. I tried to put in a CAST or even just a min(item.ITEM_ID) but am running into issues because the ITEMS table is already huge, hence the sorting by GROUP_NAME
I'm sorry I don't have a lot of code examples, because I mostly can't get anything to run. But here is how I have been joining the ITEMS and GRP tables to search by name:
SELECT
item.NAME, item.ITEM_ID, grp.NAME
FROM
API_ITEMS item
LEFT JOIN API_GROUPS grp ON item.group_id = grp.id
WHERE
grp.NAME = 'silverware'
GROUP by item.ITEM_ID ASC
I figure a lot of this is just because I'm super green with sql, but basically I'm trying to search GROUP_NAME of ITEMS to find missing ITEM_IDS
In my API_ITEMS table I have all items of all groups, I want to filter by just the grp.NAME, and find which ITEM_IDs are missing from the min(item.ITEM_ID), max(item.ITEM_ID). Not sure if I need to CAST as INT for the min/max, or if varchar works here just the same. I was thinking if I could join it to a seq4() that goes from min -> max, i could find null entries where there is a seq4() entry but no ITEM_ID entry
1 Example of this is a joined table I have with
ITEM_ID grp.NAME
1 silverware
2 silverware
3 silverware
4 silverware
5000 silverware
I've found a gap script that will output gap starts at 5, ends at 4999, but that didn't work all of my grp.NAME
In this example, what my thoughts are is that I can create a seq4() as nums table of 5000 items. Join the item table on item.ITEM_ID = nums.ID to have a
nums.ID ITEM_ID
1 1
2 2
3 3
4 4
5 null
6 null
...
5000 5000
I'd like an output where it is each ITEM.ID that is missing e.g.
MISSING_ITEM_ID
5
6
7
...
4999
Thanks to anyone that's willing to pick this up, hold my hand, and help me out here.
I'm on snowflake, I don't have access to Tally tables...
CodePudding user response:
So following your example, I have made some CTE's (that could be tables, but they are interchangeable), have chosen smaller number ranges just so the output is not too long
with items(item_id, item_name, group_id ) as (
select * from values
(1, 'item1', 100),
(2, 'item2', 100),
(3, 'item3', 100),
(4, 'item4', 100),
(9, 'item31', 100),
(1, 'item1-101', 101)
), groups(group_id, name) as (
select * from values
(100, 'silverware'),
(101, 'leadware')
)
and joining those to find the 'silverware' item, like you have
select i.item_id,
i.item_name,
g.name
from items as i
join groups as g
on i.group_id = g.group_id
where g.name = 'silverware'
gives:
ITEM_ID | ITEM_NAME | NAME |
---|---|---|
1 | item1 | silverware |
2 | item2 | silverware |
3 | item3 | silverware |
4 | item4 | silverware |
9 | item31 | silverware |
now you want a CTE with a range of numbers, for this you must use a ROW_NUMBER function as SEQx() can have gaps in the output, if you want just different numbers that go upwards SEQ is faster, but if you want to have no holes, SEQ will burn you randomly.
So lets swap that nums CTE to ROW_NUMBER, and pick a large number that is bigger than any range your will ever have. In the real world I would make a table that has 1 million rows in it, and then just use that.
with nums as (
SELECT
row_number()over(order by null) as id
from table(generator(rowcount => 5000))
)
so now we have our large collection of id in nums
and our wanted data.
- so now we have to things to do, join the range to our data, and prune the out of range values..
- find the gaps
so with these in place:
with nums as (
SELECT
row_number()over(order by null) as id
from table(generator(rowcount => 5000))
), wanted_data as (
select i.item_id,
i.item_name,
i.group_id,
g.name
from items as i
join groups as g
on i.group_id = g.group_id
where g.name = 'silverware'
), ranges as (
select group_id,
min(item_id) as min_id,
max(item_id) as max_id
from wanted_data
group by 1
)
this shows the range join, and the first half of the left join
select
n.id,
r.group_id,
w.*
from nums as n
join ranges as r
on n.id between r.min_id and r.max_id
left join wanted_data as w
on w.item_id = n.id
ID | GROUP_ID | ITEM_ID | ITEM_NAME | GROUP_ID_2 | NAME |
---|---|---|---|---|---|
1 | 100 | 1 | item1 | 100 | silverware |
2 | 100 | 2 | item2 | 100 | silverware |
3 | 100 | 3 | item3 | 100 | silverware |
4 | 100 | 4 | item4 | 100 | silverware |
5 | 100 | null | null | null | null |
6 | 100 | null | null | null | null |
7 | 100 | null | null | null | null |
8 | 100 | null | null | null | null |
9 | 100 | 9 | item31 | 100 | silverware |
now we can filter out the miss left joins, by adding to the end:
where w.item_id is null
ID | GROUP_ID | ITEM_ID | ITEM_NAME | GROUP_ID_2 | NAME |
---|---|---|---|---|---|
5 | 100 | null | null | null | null |
6 | 100 | null | null | null | null |
7 | 100 | null | null | null | null |
8 | 100 | null | null | null | null |
so we don't need to select those failed join w.*
values, but we are currently missing group name. You can ether re-join to groups table at this point (which is how I tend to do things in large workflow to avoid carrying too much values through the transformation stages). Or just carry it.
thus all the SQL together:
with items(item_id, item_name, group_id ) as (
select * from values
(1, 'item1', 100),
(2, 'item2', 100),
(3, 'item3', 100),
(4, 'item4', 100),
(9, 'item31', 100),
(1, 'item1-101', 101)
), groups(group_id, name) as (
select * from values
(100, 'silverware'),
(101, 'leadware')
), nums as (
SELECT
row_number()over(order by null) as id
from table(generator(rowcount => 5000))
), wanted_data as (
select i.item_id,
i.item_name,
i.group_id,
g.name
from items as i
join groups as g
on i.group_id = g.group_id
where g.name = 'silverware'
), ranges as (
select group_id,
name as group_name,
min(item_id) as min_id,
max(item_id) as max_id
from wanted_data
group by 1,2
)
select
n.id,
r.group_id,
r.group_name
from nums as n
join ranges as r
on n.id between r.min_id and r.max_id
left join wanted_data as w
on w.item_id = n.id
where w.item_id is null
gives:
ID | GROUP_ID | GROUP_NAME |
---|---|---|
5 | 100 | silverware |
6 | 100 | silverware |
7 | 100 | silverware |
8 | 100 | silverware |
The other range way:
now if numeric space of you id's is very high, have a huge nums table can be gross.
with items(item_id, item_name, group_id ) as (
select * from values
(1000001, 'item1', 100),
(1000002, 'item2', 100),
(1000003, 'item3', 100),
(1000004, 'item4', 100),
(1000009, 'item31', 100),
(2000001, 'item1-101', 101)
)
then with a change to nums to generate 0
), nums as (
SELECT
row_number()over(order by null)-1 as id
from table(generator(rowcount => 5000))
)
you can use a smaller range table, and just use relative math:
select
n.id r.min_id as id,
r.group_id,
r.group_name
from nums as n
join ranges as r
on n.id <= r.span
left join wanted_data as w
on w.item_id = n.id r.min_id
where w.item_id is null
to get:
ID | GROUP_ID | GROUP_NAME |
---|---|---|
1000005 | 100 | silverware |
1000006 | 100 | silverware |
1000007 | 100 | silverware |
1000008 | 100 | silverware |
CodePudding user response:
Not sure if this is exactly what you are looking for but you can use the uniform function to provide a Min and Max threshold for the random function. In this example you will get a MIN of 1 and MAX of 10 for each of the 10 rows created.
SELECT
uniform(1, 10, random(1)) as ITEM_ID
from table(generator(rowcount=>10));
Uniform documentation: https://docs.snowflake.com/en/sql-reference/functions/uniform.html