Performance of ISNULL() in GROUP BY clause SQL-CodePudding

I have been refactoring some old queries recently and noticed that a lot of them repeat ISNULL() in the GROUP BY clause, where it is used in the SELECT clause. I feel in my bones that removing the ISNULL() in the GROUP BY clause will improve performance, but I can't find any documentation on whether it is actually likely to or not. Here's the sort of thing I mean:

SELECT
     ISNULL(Foo,-1) AS Foo
    ,ISNULL(Bar,-1) AS Bar
    ,SUM(This)      AS This
    ,SUM(That)      AS That
FROM
            dbo.ThisThatTable   AS ThisThat
LEFT JOIN   dbo.FooBarTable     AS FooBar   ON ThisThat.FooBarId = FooBar.Id
GROUP BY
     ISNULL(Foo,-1)
    ,ISNULL(Bar,-1);
GO

The above is the way I keep coming across - When there is grouping on the Foo column, the SELECT and the GROUP BY for selected columns match exactly. The example below is a possible alternative - some possibly unnecessary ISNULL() calls have been removed, and the SELECT and GROUP BY clauses no longer match.

SELECT
     ISNULL(Foo,-1) AS Foo
    ,ISNULL(Bar,-1) AS Bar
    ,SUM(This)      AS This
    ,SUM(That)      AS That
FROM
            dbo.ThisThatTable   AS ThisThat
LEFT JOIN   dbo.FooBarTable     AS FooBar   ON ThisThat.FooBarId = FooBar.Id
GROUP BY
     Foo
    ,Bar;
GO

I suppose maybe when the SELECT and GROUP BY clauses match, the optimiser only has to do the ISNULL() calculation once to know what is going on, so it might be theoretically more performative to group by the results that are actually selected? Alternatively, maybe it is better to avoid adding a second set of ISNULL() calls that don't change the granularity of the data at all... Maybe the optimiser is clever enough to realise that the NULLS in the grouping are (in this case) -1s in the selection...?

I personally would prefer removing any unnecessary functions, especially once that might affect index usage but when I look online, the references to performance are all like the answers here, about using ISNULL() in the WHERE clause, which I already know to avoid.

I also suspect that any gains are going to be vanishingly small, so this is really asking for an academic or theoretical answer, but as I work, I keep wondering and it bugs me, so I thought I would ask if anyone has any thoughts.

CodePudding user response：

Non-aggregated columns in SELECT clauses generally must precisely match the ones in GROUP BY clauses. If I were you, and I were dealing with tested production code, I would not make the change you propose.

Edit the match between non-aggregated SELECT columns and GROUP BY columns is necessary for GROUP BY. If the columns in SELECT are 1:1 dependent on the columns in GROUP BY, it will work. Otherwise the results are ambiguous.

CodePudding user response：

Internally, SQL does not really have two copies of each ISNULL. They are all flattened together in the internal tree used during compilation. So, this level of optimization is not useful to consider in SQL Server. A query without any ISNULL in it would probably perform a bit faster and potentially a lot faster depending on the rest of the schema and query. However, the ISNULL in the select list and the GROUP BY list are not executed twice in most cases within SQL - this level of detail can show up in showplan, but it's often below the level of detail most people would care to examine.

CodePudding user response：

There are a few different aspects to consider here:

Referring to the same value multiple times in the same scope
In most situations, the optimizer is clever enough to collapse these into calculating them once. The fact that you have a GROUP BY over them makes this even more likely.
Is it faster to group when the value is guaranteed to not be null?
Possibly, although I doubt the difference is measurable.
The SELECT does not have to match exactly, it only needs to be functionally dependent on GROUP BY columns and aggregation functions. It may not be functionally dependent on any other columns.
The most important thing top consider: indexing.
This is much, much more important than the other considerations. When grouping, if you can hit an index then it will go much faster, because it can remove sorting and just use Stream Aggregate. This is not possible if you use ISNULL in the GROUP BY (barring computed columns or indexed views).

Note that your results will not be the same: the first example collapses the NULL group into the -1 group. The second example does not, so you may want to remove the ISNULL from the SELECT also, in order to differentiate them. Alternatively, put a WHERE ... IS NOT NULL instead.