How to pick the first and last from a given ordering when partitioning items into groups?-CodePudding

The real problem is with some huge product catalogs. In order to learn how to SQL this properly, I'm testing with this dataset. The jumbled up order of things is intended to help ensure that the logic will be sound.

The following is a listing of the test data.

select * from test;

id	title	time
234	234_3	3
999	999_11	11
123	123_4	4
234	234_9	9
123	123_2.3	2.3
999	999_1.2	1.2
999	999_5	5
99	99_7	7
999	999_6	6
10023	shoe	3
10023	shoe	4
10023	NOT A SHOE	5
10023	NOT A SHOE	6
10024	shirt	2
10024	garbage	3
10024	shirt	4
234	234_1	1

As you can see the title here are partially named to help correlate with the other values.

The desired output is the following:

id	t_start	t_finish	title_start	title_finish
99	7	7	99_7	99_7
123	2.3	4	123_2.3	123_4
234	1	9	234_1	234_9
999	1.2	11	999_1.2	999_11
10023	3	6	shoe	NOT A SHOE
10024	2	4	shirt	shirt

At first I tried using window functions. I started with this to understand how the window function works by using an aggregate function to display the contents of the "window".

select string_agg(title, ', ') over (partition by id order by time) from test;

string_agg
99_7
123_2.3
123_2.3, 123_4
234_1
234_1, 234_3
234_1, 234_3, 234_9
999_1.2
999_1.2, 999_5
999_1.2, 999_5, 999_6
999_1.2, 999_5, 999_6, 999_11
shoe
shoe, shoe
shoe, shoe, NOT A SHOE
shoe, shoe, NOT A SHOE, NOT A SHOE
shirt
shirt, garbage
shirt, garbage, shirt

This indicates the desired application of grouping by id and ordering by time. I have gone into this result table and bolded the information that I actually need.

Unfortunately I am at a complete loss as to how to:

only choose the last of these partition windows
choose only the first and last items

Furthermore:

prefer avoiding stitching strings only to split them back out: real titles can have commas, or indeed any separator I might otherwise end up choosing, inside.

A second approach I went ahead and took is the below:

select 
    d.id,
    d.title as title_start,
    test.title as title_finish,
    start,
    finish
from
(
  select 
    c.id,
    title,
    start,
    finish
  from (
    (
        select min(time) as start, id from test group by id
    ) a
    inner join
    (
        select max(time) as finish, id from test group by id
    ) b 
    using (id)
  ) c
  left join test
  on c.id = test.id
  where c.start = test.time
) d
left join test
on d.id = test.id
and d.finish = test.time
order by id;

id	title_start	title_finish	start	finish
99	99_7	99_7	7	7
123	123_2.3	123_4	2.3	4
234	234_1	234_9	1	9
999	999_1.2	999_11	1.2	11
10023	shoe	NOT A SHOE	3	6
10024	shirt	shirt	2	4

View on DB Fiddle

Here I was able to get all the way there to my desired result by brute forcing joins. But this SQL code is extremely difficult to follow. If I saw this presented to me as a solution I would feel pretty strongly about rejecting it. I would really like it if a SQL expert could review this and tell me the actual right way to code up this problem. Surely three joins are not needed?

CodePudding user response：

Your SQL based solution was actually pretty good. I reworked it only for readability and to make it a little easier to see where things are coming from and moving to.

select 
    start.id,
    start.title as title_start,
    finish.title as title_finish,
    start.time as start,
    finish.time as finish
from 
    (select test.id, title, sub.time from test
        inner join (select min(time) as time, id from test group by id) sub
            on test.id = sub.id and test.time = sub.time) start
inner join
    (select test.id, title, sub.time from test
        inner join (select max(time) as time, id from test group by id) sub
            on test.id = sub.id and test.time = sub.time) finish
    using(id)
    order by start.id;

Here is the fiddle

CodePudding user response：

This is a gaps-and-islands problem., you can try to use row_number() window functions, The trick point is to subtract row numbers so that consecutive rows fall in the same group (grp) per (id)

Query #2

WITH CTE AS (
    SELECT id,
           MIN (time) t_start,
           MAX (time) t_finish
    FROM (
        SELECT *, ROW_NUMBER() OVER(ORDER BY id,time) - ROW_NUMBER() OVER(PARTITION BY id ORDER BY time) grp
        FROM test
    ) t1
    GROUP BY id,grp
)
SELECT c.id,
       c.t_start,
       c.t_finish,
       MAX(CASE WHEN c.t_start = t1.time THEN title END),
       MAX(CASE WHEN c.t_finish = t1.time THEN title END)
FROM CTE c 
INNER JOIN test t1 
ON c.id = t1.id
GROUP BY c.id,
       c.t_start,
       c.t_finish
ORDER BY c.id;

id	t_start	t_finish	max	max
99	7	7	99_7	99_7
123	2.3	4	123_2.3	123_4
234	1	9	234_1	234_9
999	1.2	11	999_1.2	999_11
10023	3	6	shoe	NOT A SHOE
10024	2	4	shirt	shirt

Another way I think that will be simpler than the above answer, unnecessary self-join use two ROW_NUMBER in CTE, one is for ORDER BY time, another one is for ORDER BY time DESC, then use the aggregate condition function to get the result.

;WITH CTE AS (
    SELECT time,
           title,
           id,
           ROW_NUMBER() OVER(PARTITION BY id,grp ORDER BY time) first_rn,
           ROW_NUMBER() OVER(PARTITION BY id,grp ORDER BY time DESC) last_rn
    FROM (
        SELECT *, ROW_NUMBER() OVER(ORDER BY id,time) - ROW_NUMBER() OVER(PARTITION BY id ORDER BY time) grp
        FROM test
    ) t1
)
SELECT id,
       MAX(CASE WHEN first_rn = 1 THEN time END) t_start,
       MAX(CASE WHEN last_rn = 1  THEN time END) t_finish,
       MAX(CASE WHEN first_rn = 1 THEN title END) title_start,
       MAX(CASE WHEN last_rn = 1 THEN title END) title_finish
FROM CTE 
GROUP BY id
ORDER BY ID

View on DB Fiddle