How does the CBO uses clustering key for an ORDER BY clause in snowflake?-CodePudding

Test Description

The test is conducted to find out how CBO uses the clustering feature, to run an ORDER BY query with the clustered key.

Assumptions and Pre-Conditions

According to snowflake docs:

https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html

“ Queries benefit from clustering when the queries filter or sort on the clustering key for the table. Sorting is commonly done for ORDER BY operations, for GROUP BY operations, and for some joins. ”

Conforming to it, if an ORDER BY clause is performed on the clustering key, the optimiser uses it in some way to enhance the query performance.

Test Data

Data used for this testing is the orders table from the snowflake sample_data (shared )

"SAMPLE_DATA"."TPCH_SF1000"."ORDERS"

Steps to be Executed

Created a new table with the properties of the ORDERS table in an owned database and copied all the values ;

CREATE TABLE ORDERS LIKE "SAMPLE_DATA"."TPCH_SF1000"."ORDERS";

INSERT INTO   ORDERS SELECT * FROM  "SAMPLE_DATA"."TPCH_SF1000"."ORDERS";

Removed the clustering key set on the orders table :

 ALTER TABLE ORDERS  DROP  CLUSTERING KEY;

Created a new column ORDERS_NEW (NUMBER 10,0) after converting the O_ORDERDATE (DATE) into a number.

ALTER TABLE ORDERS ADD COLUMN ORDERS_NEW NUMBER(10,0);

UPDATE  ORDERS set  ORDERS_NEW =TO_NUMBER(REPLACE(TO_VARCHAR(O_ORDERDATE),'-')) ;

Layout of the new table

Clustered the table using the ORDERS_NEW column

ALTER TABLE ORDERS  CLUSTER BY  (ORDERS_NEW) ; 

ALTER TABLE  ORDERS RESUME RECLUSTER ;

Detected the logical execution plan using EXPLAIN on an query with ORDER BY clause on clustering key ( in this case : ORDERS_NEW)

EXPLAIN SELECT  * FROM   ORDERS ORDER BY ORDERS_NEW ;

Expected Result

Since after using the clustering key on the table it is supposed to be pre-sorted, the expectation would be that it does not need to be sorted on an execution.

Even if the sorting is in place with the execution plan it should be using the clustering feature, as defined in the documentation in some way to enhance the query performance.

Actual Result

The logical execution plan uses the SORT operation on the clustered table which makes the clustering feature redundant for the ORDER BY clause.

RESULT OF EXPLAIN QUERY

Conclusion and Discussions

In comparison with the expected result to the actual result, it fails to meet the functioning stated by the snowflake.

By this, I would extend our discussion to Why didn't Snowflake use the cluster key? And in what circumstances would it have done so?

CodePudding user response：

Why didn't Snowflake use the cluster key?

Because the query optimiser determined that it didn't need to

And in what circumstances would it have done so?

when the query optimiser determines that using the cluster key would be beneficial

CodePudding user response：

Since after using the clustering key on the table it is supposed to be pre-sorted...

Reclustering is not the same as sorting. Inserting sorted rows is one way to force reclustering, but the two are not the same.

...the expectation would be that it does not need to be sorted on an execution.

Snowflake does not store data internally this way for tables. If you run a query with an ORDER BY applied to it, then the results in the result set cache will be stored in sorted order. However, that's not how tables backed by micropartitions are stored.

...Why didn't Snowflake use the cluster key?

Snowflake does use the clustering key in a number of different ways. The most common way is the auto reclustering service runs transparently in the background to recluster rows in the micropartitions. The goal is to cost-effectively minimize the differences between the MIN and MAX value for the clustering key in each micropartition. In this case it's also applied using a sort in the DML. Also on some DML operations it will attempt to preserve clustering on the rows resulting from the DML.

Currently the optimizer does not use (or need to use) the clustering key. Rather, the compaction of the MIN/MAX range on the micropartitions is what will improve performance. The optimizer will push down all eligible filters to the table scan for partition pruning. Reclustering can make that more effective, but the optimizer will do it whether reclustering has been done or not.