Home > Net >  How to pick the first and last from a given ordering when partitioning items into groups?
How to pick the first and last from a given ordering when partitioning items into groups?

Time:02-23

The real problem is with some huge product catalogs. In order to learn how to SQL this properly, I'm testing with this dataset. The jumbled up order of things is intended to help ensure that the logic will be sound.

The following is a listing of the test data.

select * from test;
id title time
234 234_3 3
999 999_11 11
123 123_4 4
234 234_9 9
123 123_2.3 2.3
999 999_1.2 1.2
999 999_5 5
99 99_7 7
999 999_6 6
10023 shoe 3
10023 shoe 4
10023 NOT A SHOE 5
10023 NOT A SHOE 6
10024 shirt 2
10024 garbage 3
10024 shirt 4
234 234_1 1

As you can see the title here are partially named to help correlate with the other values.

The desired output is the following:

id t_start t_finish title_start title_finish
99 7 7 99_7 99_7
123 2.3 4 123_2.3 123_4
234 1 9 234_1 234_9
999 1.2 11 999_1.2 999_11
10023 3 6 shoe NOT A SHOE
10024 2 4 shirt shirt

At first I tried using window functions. I started with this to understand how the window function works by using an aggregate function to display the contents of the "window".

select string_agg(title, ', ') over (partition by id order by time) from test;
string_agg
99_7
123_2.3
123_2.3, 123_4
234_1
234_1, 234_3
234_1, 234_3, 234_9
999_1.2
999_1.2, 999_5
999_1.2, 999_5, 999_6
999_1.2, 999_5, 999_6, 999_11
shoe
shoe, shoe
shoe, shoe, NOT A SHOE
shoe, shoe, NOT A SHOE, NOT A SHOE
shirt
shirt, garbage
shirt, garbage, shirt

This indicates the desired application of grouping by id and ordering by time. I have gone into this result table and bolded the information that I actually need.

Unfortunately I am at a complete loss as to how to:

  • only choose the last of these partition windows
  • choose only the first and last items

Furthermore:

  • prefer avoiding stitching strings only to split them back out: real titles can have commas, or indeed any separator I might otherwise end up choosing, inside.

A second approach I went ahead and took is the below:

select 
    d.id,
    d.title as title_start,
    test.title as title_finish,
    start,
    finish
from
(
  select 
    c.id,
    title,
    start,
    finish
  from (
    (
        select min(time) as start, id from test group by id
    ) a
    inner join
    (
        select max(time) as finish, id from test group by id
    ) b 
    using (id)
  ) c
  left join test
  on c.id = test.id
  where c.start = test.time
) d
left join test
on d.id = test.id
and d.finish = test.time
order by id;
id title_start title_finish start finish
99 99_7 99_7 7 7
123 123_2.3 123_4 2.3 4
234 234_1 234_9 1 9
999 999_1.2 999_11 1.2 11
10023 shoe NOT A SHOE 3 6
10024 shirt shirt 2 4

View on DB Fiddle

Here I was able to get all the way there to my desired result by brute forcing joins. But this SQL code is extremely difficult to follow. If I saw this presented to me as a solution I would feel pretty strongly about rejecting it. I would really like it if a SQL expert could review this and tell me the actual right way to code up this problem. Surely three joins are not needed?

CodePudding user response:

Your SQL based solution was actually pretty good. I reworked it only for readability and to make it a little easier to see where things are coming from and moving to.

select 
    start.id,
    start.title as title_start,
    finish.title as title_finish,
    start.time as start,
    finish.time as finish
from 
    (select test.id, title, sub.time from test
        inner join (select min(time) as time, id from test group by id) sub
            on test.id = sub.id and test.time = sub.time) start
inner join
    (select test.id, title, sub.time from test
        inner join (select max(time) as time, id from test group by id) sub
            on test.id = sub.id and test.time = sub.time) finish
    using(id)
    order by start.id;

Here is the fiddle

CodePudding user response:

This is a gaps-and-islands problem., you can try to use row_number() window functions, The trick point is to subtract row numbers so that consecutive rows fall in the same group (grp) per (id)

Query #2

WITH CTE AS (
    SELECT id,
           MIN (time) t_start,
           MAX (time) t_finish
    FROM (
        SELECT *, ROW_NUMBER() OVER(ORDER BY id,time) - ROW_NUMBER() OVER(PARTITION BY id ORDER BY time) grp
        FROM test
    ) t1
    GROUP BY id,grp
)
SELECT c.id,
       c.t_start,
       c.t_finish,
       MAX(CASE WHEN c.t_start = t1.time THEN title END),
       MAX(CASE WHEN c.t_finish = t1.time THEN title END)
FROM CTE c 
INNER JOIN test t1 
ON c.id = t1.id
GROUP BY c.id,
       c.t_start,
       c.t_finish
ORDER BY c.id;
id t_start t_finish max max
99 7 7 99_7 99_7
123 2.3 4 123_2.3 123_4
234 1 9 234_1 234_9
999 1.2 11 999_1.2 999_11
10023 3 6 shoe NOT A SHOE
10024 2 4 shirt shirt

Another way I think that will be simpler than the above answer, unnecessary self-join use two ROW_NUMBER in CTE, one is for ORDER BY time, another one is for ORDER BY time DESC, then use the aggregate condition function to get the result.

;WITH CTE AS (
    SELECT time,
           title,
           id,
           ROW_NUMBER() OVER(PARTITION BY id,grp ORDER BY time) first_rn,
           ROW_NUMBER() OVER(PARTITION BY id,grp ORDER BY time DESC) last_rn
    FROM (
        SELECT *, ROW_NUMBER() OVER(ORDER BY id,time) - ROW_NUMBER() OVER(PARTITION BY id ORDER BY time) grp
        FROM test
    ) t1
)
SELECT id,
       MAX(CASE WHEN first_rn = 1 THEN time END) t_start,
       MAX(CASE WHEN last_rn = 1  THEN time END) t_finish,
       MAX(CASE WHEN first_rn = 1 THEN title END) title_start,
       MAX(CASE WHEN last_rn = 1 THEN title END) title_finish
FROM CTE 
GROUP BY id
ORDER BY ID

View on DB Fiddle

  • Related