SQL group by selecting top rows with possible nulls-CodePudding

The example table:

id	name	create_time	group_id
1	a	2022-01-01 12:00:00	group1
2	b	2022-01-01 13:00:00	group1
3	c	2022-01-01 12:00:00	NULL
4	d	2022-01-01 13:00:00	NULL
5	e	NULL	group2

I need to get top 1 rows (with the minimal create_time) grouped by group_id with these conditions:

create_time can be null - it should be treated as a minimal value
group_id can be null - all rows with nullable group_id should be returned (if it's not possible, we can use coalesce(group_id, id) or sth like that assuming that ids are unique and never collide with group ids)
it should be possible to apply pagination on the query (so join can be a problem)
the query should be universal as much as possible (so no vendor-specific things). Again, if it's not possible, it should work in MySQL 5&8, PostgreSQL 9 and H2

The expected output for the example:

id	name	create_time	group_id
1	a	2022-01-01 12:00:00	group1
3	c	2022-01-01 12:00:00	NULL
4	d	2022-01-01 13:00:00	NULL
5	e	NULL	group2

I've already read similar questions on SO but 90% of answers are with specific keywords (numerous answers with PARTITION BY like https://stackoverflow.com/a/6841644/5572007) and others don't honor null values in the group condition columns and probably pagination (like https://stackoverflow.com/a/14346780/5572007).

CodePudding user response：

I would guess

SELECT id, name, MAX(create_time), group_id
FROM tb GROUP BY group_id 
UNION ALL
SELECT id, name, create_time, group_id
FROM tb WHERE group_id IS NULL
ORDER BY name

I should point out that 'name' is a reserved word.

CodePudding user response：

select * from T t1
where coalesce(create_time, 0) = (
    select min(coalesce(create_time, 0)) from T t2
    where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)

Not sure how you imagine "pagination" should work. Here's one way:

and (
    select count(distinct coalesce(t2.group_id, t2.id)) from T t2
    where coalesce(t2.group_id, t2.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5 /* for example */
order by coalesce(t1.group_id, t1.id)

I'm assuming there's an implicit cast from 0 to a date value with a resulting value lower than all those in your database. Not sure if that's reliable. (Try '19000101' instead?) Otherwise the rest should be universal. You could probably also parameterize that in the same way as the page range.

You've also got a potential a complication with potential collisions between the group_id and id spaces. Yours don't appear to have that problem though having mixed data types creates its own issues.

This all gets more difficult when you want to order by other columns like name:

select * from T t1
where coalesce(create_time, 0) = (
    select min(coalesce(create_time, 0)) from T t2
    where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
) and (
    select count(*) from (
        select * from T t1
        where coalesce(create_time, 0) = (
            select min(coalesce(create_time, 0)) from T t2
            where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
        )
    ) t3
    where t3.name < t1.name or t3.name = t1.name
        and coalesce(t3.group_id, t3.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5
order by t1.name;

That does handle ties but also makes the simplifying assumption that name can't be null which would add yet another small twist. At least you can see that it's possible without CTEs and window functions but expect these to also be a lot less efficient to run.

https://dbfiddle.uk/?rdbms=mysql_5.5&fiddle=9697fd274e73f4fa7c1a3a48d2c78691

CodePudding user response：

You can combine two queries with UNION ALL. E.g.:

select id, name, create_time, group_id
from mytable
where group_id is not null
and not exists
(
  select null
  from mytable older
  where older.group_id = mytable.group_id
  and older.create_time < mytable.create_time  
)
union all
select id, name, create_time, group_id
from mytable
where group_id is null
order by id;

This is standard SQL and very basic at that. It should work in about every RDBMS.

As to pagination: This is usually costly, as you run the same query again and again in order to always pick the "next" part of the result, instead of running the query only once. The best approach is usually to use the primary key to get to the next part so an index on the key can be used. In above query we'd ideally add where id > :last_biggest_id to the queries and limit the result, which would be fetch next <n> rows only in standard SQL. Everytime we run the query, we use the last read ID as :last_biggest_id, so we read on from there.

Variables, however, are dealt with differently in the various DBMS; most commonly they are preceded by either a colon, a dollar sign or an at sign. And the standard fetch clause, too, is supported by only some DBMS, while others have a LIMIT or TOP clause instead.

If these little differences make it impossible to apply them, then you must find a workaround. For the variable this can be a one-row-table holding the last read maximum ID. For the fetch clause this can mean you simply fetch as many rows as you need and stop there. Of course this isn't ideal, as the DBMS doesn't know then that you only need the next n rows and cannot optimize the execution plan accordingly.

And then there is the option not to do the pagination in the DBMS, but read the complete result into your app and handle pagination there (which then becomes a mere display thing and allocates a lot of memory of course).