How does order by work when all column values are identical?-CodePudding

I use SQL Server 2016. Below is the rows in table: test_account. You can see the values of updDtm and fileCreateTime are identical. id is the primary key.

id  accno   updDtm  fileCreatedTime
-----------------------------------------------------------------------
1   123456789   2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
2   123456789   2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
3   123456789   2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000

I want to query the latest account id which accno is 123456789 order by updDtm, fileCreatedTime

I run the following SQL, the output result is id = 1

SELECT t.id 
FROM
    (SELECT 
         ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC) AS seq, 
         a.id, a.accno, a.updDtm, a.fileCreatedTime
     FROM 
         test_account a) AS t
WHERE t.seq = 1

My question is does the query result is repeatable and reliable (always output id=1 either run 1 time or multiple times) when the values of columns updDtm and fileCreatedTime are identical or just output the random id?

I read some articles and learn that for MySql and Oracle the query result is not reliable and reproducible. How about SQL Server?

CodePudding user response：

The context of this documentation reference is ORDER BY usage with OFFSET and FETCH but the same considerations apply to all ORDER BY usage, including windowing functions like ROW_NUMBER(). In summary,

To achieve stable results between query requests, the following conditions must be met:

The underlying data that is used by the query must not change.
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.

I'm trying to find an case to test if the query would output result other than id=1 but with no luck

The ordering of rows when duplicate ORDER BY values exist is undefined (a.k.a. non-deterministic and arbitrary) because it depends on the execution plan (which may vary due to available indexes, stats, and the optimizer), parallelism, database engine internals, and even physical data storage. The example below yields different results due to a parallel plan on my test instance.

DROP TABLE IF EXISTS dbo.test_account;
CREATE TABLE dbo.test_account(
      id int NOT NULL
        CONSTRAINT pk_test_account PRIMARY KEY CLUSTERED
    , accno int NOT NULL
    , updDtm datetime2 NOT NULL
    , fileCreatedTime datetime2 NOT NULL
);
--insert 100K rows
WITH 
     t10 AS (SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(n))
    ,t1k AS (SELECT 0 AS n FROM t10 AS a CROSS JOIN t10 AS b CROSS JOIN t10 AS c)
    ,t1g AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS num FROM t1k AS a CROSS JOIN t1k AS b CROSS JOIN t1k AS c)
INSERT INTO dbo.test_account (id, accno, updDtm, fileCreatedTime)
SELECT num, 123456789, '2022-07-27 09:41:10.0000000', '2022-07-27 11:33:33.8300000'
FROM t1g
WHERE num <= 100000;
GO
--run query 10 times
SELECT t.id 
FROM
    (SELECT 
         ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC) AS seq, 
         a.id, a.accno, a.updDtm, a.fileCreatedTime
     FROM 
         test_account a) AS t
WHERE t.seq = 1;
GO 10

Example results:

The simple solution is to add the primary key as the last column to the ORDER BY clause to break ties. This returns the same id value (1) in every iteration regardless of the execution plan and indexes.

SELECT t.id 
FROM
    (SELECT 
         ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC, a.id) AS seq, 
         a.id, a.accno, a.updDtm, a.fileCreatedTime
     FROM 
         test_account a) AS t
WHERE t.seq = 1;
GO 10

On a side note, this index will optimize the query:

CREATE NONCLUSTERED INDEX idx ON dbo.test_account(accno, updDtm DESC, fileCreatedTime DESC, id);