The real problem is with some huge product catalogs. In order to learn how to SQL this properly, I'm testing with this dataset. The jumbled up order of things is intended to help ensure that the logic will be sound.
The following is a listing of the test data.
select * from test;
id | title | time |
---|---|---|
234 | 234_3 | 3 |
999 | 999_11 | 11 |
123 | 123_4 | 4 |
234 | 234_9 | 9 |
123 | 123_2.3 | 2.3 |
999 | 999_1.2 | 1.2 |
999 | 999_5 | 5 |
99 | 99_7 | 7 |
999 | 999_6 | 6 |
10023 | shoe | 3 |
10023 | shoe | 4 |
10023 | NOT A SHOE | 5 |
10023 | NOT A SHOE | 6 |
10024 | shirt | 2 |
10024 | garbage | 3 |
10024 | shirt | 4 |
234 | 234_1 | 1 |
As you can see the title here are partially named to help correlate with the other values.
The desired output is the following:
id | t_start | t_finish | title_start | title_finish |
---|---|---|---|---|
99 | 7 | 7 | 99_7 | 99_7 |
123 | 2.3 | 4 | 123_2.3 | 123_4 |
234 | 1 | 9 | 234_1 | 234_9 |
999 | 1.2 | 11 | 999_1.2 | 999_11 |
10023 | 3 | 6 | shoe | NOT A SHOE |
10024 | 2 | 4 | shirt | shirt |
At first I tried using window functions. I started with this to understand how the window function works by using an aggregate function to display the contents of the "window".
select string_agg(title, ', ') over (partition by id order by time) from test;
string_agg |
---|
99_7 |
123_2.3 |
123_2.3, 123_4 |
234_1 |
234_1, 234_3 |
234_1, 234_3, 234_9 |
999_1.2 |
999_1.2, 999_5 |
999_1.2, 999_5, 999_6 |
999_1.2, 999_5, 999_6, 999_11 |
shoe |
shoe, shoe |
shoe, shoe, NOT A SHOE |
shoe, shoe, NOT A SHOE, NOT A SHOE |
shirt |
shirt, garbage |
shirt, garbage, shirt |
This indicates the desired application of grouping by id and ordering by time. I have gone into this result table and bolded the information that I actually need.
Unfortunately I am at a complete loss as to how to:
- only choose the last of these partition windows
- choose only the first and last items
Furthermore:
- prefer avoiding stitching strings only to split them back out: real titles can have commas, or indeed any separator I might otherwise end up choosing, inside.
A second approach I went ahead and took is the below:
select
d.id,
d.title as title_start,
test.title as title_finish,
start,
finish
from
(
select
c.id,
title,
start,
finish
from (
(
select min(time) as start, id from test group by id
) a
inner join
(
select max(time) as finish, id from test group by id
) b
using (id)
) c
left join test
on c.id = test.id
where c.start = test.time
) d
left join test
on d.id = test.id
and d.finish = test.time
order by id;
id | title_start | title_finish | start | finish |
---|---|---|---|---|
99 | 99_7 | 99_7 | 7 | 7 |
123 | 123_2.3 | 123_4 | 2.3 | 4 |
234 | 234_1 | 234_9 | 1 | 9 |
999 | 999_1.2 | 999_11 | 1.2 | 11 |
10023 | shoe | NOT A SHOE | 3 | 6 |
10024 | shirt | shirt | 2 | 4 |
Here I was able to get all the way there to my desired result by brute forcing joins. But this SQL code is extremely difficult to follow. If I saw this presented to me as a solution I would feel pretty strongly about rejecting it. I would really like it if a SQL expert could review this and tell me the actual right way to code up this problem. Surely three joins are not needed?
CodePudding user response:
Your SQL based solution was actually pretty good. I reworked it only for readability and to make it a little easier to see where things are coming from and moving to.
select
start.id,
start.title as title_start,
finish.title as title_finish,
start.time as start,
finish.time as finish
from
(select test.id, title, sub.time from test
inner join (select min(time) as time, id from test group by id) sub
on test.id = sub.id and test.time = sub.time) start
inner join
(select test.id, title, sub.time from test
inner join (select max(time) as time, id from test group by id) sub
on test.id = sub.id and test.time = sub.time) finish
using(id)
order by start.id;
Here is the fiddle
CodePudding user response:
This is a gaps-and-islands problem., you can try to use row_number()
window functions, The trick point is to subtract row numbers so that consecutive rows fall in the same group (grp
) per (id
)
Query #2
WITH CTE AS (
SELECT id,
MIN (time) t_start,
MAX (time) t_finish
FROM (
SELECT *, ROW_NUMBER() OVER(ORDER BY id,time) - ROW_NUMBER() OVER(PARTITION BY id ORDER BY time) grp
FROM test
) t1
GROUP BY id,grp
)
SELECT c.id,
c.t_start,
c.t_finish,
MAX(CASE WHEN c.t_start = t1.time THEN title END),
MAX(CASE WHEN c.t_finish = t1.time THEN title END)
FROM CTE c
INNER JOIN test t1
ON c.id = t1.id
GROUP BY c.id,
c.t_start,
c.t_finish
ORDER BY c.id;
id | t_start | t_finish | max | max |
---|---|---|---|---|
99 | 7 | 7 | 99_7 | 99_7 |
123 | 2.3 | 4 | 123_2.3 | 123_4 |
234 | 1 | 9 | 234_1 | 234_9 |
999 | 1.2 | 11 | 999_1.2 | 999_11 |
10023 | 3 | 6 | shoe | NOT A SHOE |
10024 | 2 | 4 | shirt | shirt |
Another way I think that will be simpler than the above answer, unnecessary self-join use two ROW_NUMBER
in CTE, one is for ORDER BY time
,
another one is for ORDER BY time DESC
, then use the aggregate condition function to get the result.
;WITH CTE AS (
SELECT time,
title,
id,
ROW_NUMBER() OVER(PARTITION BY id,grp ORDER BY time) first_rn,
ROW_NUMBER() OVER(PARTITION BY id,grp ORDER BY time DESC) last_rn
FROM (
SELECT *, ROW_NUMBER() OVER(ORDER BY id,time) - ROW_NUMBER() OVER(PARTITION BY id ORDER BY time) grp
FROM test
) t1
)
SELECT id,
MAX(CASE WHEN first_rn = 1 THEN time END) t_start,
MAX(CASE WHEN last_rn = 1 THEN time END) t_finish,
MAX(CASE WHEN first_rn = 1 THEN title END) title_start,
MAX(CASE WHEN last_rn = 1 THEN title END) title_finish
FROM CTE
GROUP BY id
ORDER BY ID