Home > Software design >  Is SQL floating point sum affected by the order-by clause?
Is SQL floating point sum affected by the order-by clause?

Time:10-17

Unlike the pure mathematical real numbers, or even the rational numbers, the floating-point number types are not commutative-associative. This means, as is commonly known in numerical coding, the order of a list of floating-point numbers affects the value of the floating-point sum. It can be a good idea to sort the list to put the smaller numbers first before adding up.

An SQL table does have an order. And this can be specified in an order-by clause.

Does the SQL engine sum a field in the order given in the order-by clause?

Can I force the SQL engine to add up starting with the smallest first by sorting a list of positive floating point numbers in ascending order?

Can I force the engine to add up largest first? Or is the order of summation not dependent on the order of the tables? Or perhaps the summation order is related to the order-by order in a complicated and indeterminate manner?

I recognize that this might depend on the choice of SQL engine. My core interest is in SnowFlake.

CodePudding user response:

Yes, floating point numbers are order specific, and yes, floating point numbers in Snowflake are impacted by this, they have an article somewhere talking about sort order instability and it's impact on things like ORDER-BY's, which is the opposite direction from your question.

And in theory yes, ordering a sub-select would allow for controlling for order problems. Unlike for example MS Sql Server which does not allow for ORDER-BY's on sub-select, Snowflake does allow you to do them. The real issue is do they always respect it. And I would tend to believe they don't, as most of the operations are parallelable, or which summing is one that is normally "safe", thus I would suspect you will not be able to force the order to be respected.

One idea would I had that I though would work, was to write the values to a temporary table, with the order by on that, but then you get the problem of read from that will equally likely get parallelized. You could enter image description here

versus:

CREATE OR REPLACE TABLE test_2(id INT IDENTITY(1,1) PRIMARY KEY, val FLOAT);

INSERT INTO test_2(val) VALUES (100000000000000000000000000);
INSERT INTO test_2(val) VALUES (-100000000000000000000000000);
INSERT INTO test_2(val)
SELECT 5000000
FROM TABLE(GENERATOR(ROWCOUNT=>1000));


SELECT SUM(val) AS total
FROM test_2 t;
-- 5000000000

enter image description here

To sum up: Using exact data type - NUMBER is a way to go.

  • Related