LEFT JOIN With Redundant Predicate Performs Better Than a CROSS JOIN?-CodePudding

I'm looking at the execution plans for two of these statements and am kind of stumped on why the LEFT JOIN statement performs better than the CROSS JOIN statement:

Table Definitions:

CREATE TABLE [Employee] (
    [ID]                int             NOT NULL    IDENTITY(1,1),
    [FirstName]         varchar(40)     NOT NULL,
    CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED ([ID] ASC)
);

CREATE TABLE [dbo].[Numbers] (
    [N] INT IDENTITY (1, 1) NOT NULL,
    CONSTRAINT [PK_Numbers] PRIMARY KEY CLUSTERED ([N] ASC)
); --The Numbers table contains numbers 0 to 100,000.

Queries in Question where I join one 'day' to each Employee:

DECLARE @PeriodStart AS date = '2019-11-05';
DECLARE @PeriodEnd AS date = '2019-11-05';

SELECT E.FirstName, CD.ClockDate
FROM Employee E
    CROSS JOIN (SELECT DATEADD(day, N.N, @PeriodStart) AS ClockDate 
                FROM Numbers N 
                WHERE N.N <= DATEDIFF(day, @PeriodStart, @PeriodEnd)
        ) CD
WHERE E.ID > 2000;

SELECT E.FirstName, CD.ClockDate
FROM Employee E
    LEFT JOIN (SELECT DATEADD(day, N.N, @PeriodStart) AS ClockDate 
                FROM Numbers N 
                WHERE N.N <= DATEDIFF(day, @PeriodStart, @PeriodEnd)
        ) CD ON CD.ClockDate = CD.ClockDate
WHERE E.ID > 2000;

The Execution Plans:

As you can see, according to the optimizer the second (left join) query with the seemingly redundant predicate seems to cost way less than the first (cross join) query. This is also the case when the period dates span multiple days.

What's weird is if I change the LEFT JOIN's predicate to something different like 1 = 1 it'll perform like the CROSS APPLY. I also tried changing the SELECT portion of the LEFT JOIN to SELECT N and joined on CD.N = CD.N ... but that also seems to perform poorly.

According to the execution plan, the second query has an index seek that only reads 3000 rows from the Numbers table while the first query is reading 10 times as many. The second query's index seek also has this predicate (which I assume comes from the LEFT JOIN):

dateadd(day,[Numbers].[N] as [N].[N],[@PeriodStart])=dateadd(day,[Numbers].[N] as [N].[N],[@PeriodStart])

I would like to understand why the second query seems to perform so much better even though I wouldn't except it to? Does it have something to do with the fact I'm joining the results of the DATEADD function? Is SQL evaluating the results of DATEADD before joining?

CodePudding user response：

The reason these queries get different estimates, even though the plan is almost the same and will probably take the same time, appears to be because DATEADD(day, N.N, @PeriodStart) is nullable, therefore CD.ClockDate = CD.ClockDate essentially just verifies that the result is not null. The optimizer cannot see that it will always be non-null, so takes the row-estimate down because of it.

But it seems to me that the primary performance problem in your query is that you are selecting the whole of your numbers table every time. Instead you should just select the amount of rows you need

SELECT E.FirstName, CD.ClockDate
FROM Employee E
    CROSS JOIN (
        SELECT TOP (DATEDIFF(day, @PeriodStart, @PeriodEnd)   1)
            DATEADD(day, N.N, @PeriodStart) AS ClockDate
        FROM Numbers N
        ORDER BY N.N
    ) CD
WHERE E.ID > 2000;

Using this technique, you can even use CROSS APPLY (SELECT TOP (outerValue) if you want to correlate the amount of rows to the rest of the query.

For further tips on numbers tables, see Itzik Ben-Gan's excellent series