Do the following two joins operate in the same way in AWS Redshift in terms of performance?
Join 1
Select col_a, col_b
from table_1
inner join table_2 on table_1.key1 = table_2.key2
where table_2.col_b = 'Y';
Join 2
Select col_a, col_b
from table_1
inner join table_2 on table_1.key1 = table_2.key2 and table_2.col_b = 'Y';
CodePudding user response:
They should and for a query this simple I hope so. However, Redshift uses several query optimizers depending on the situation and each has its own strengths so in some (complex) circumstances I can see things diverging.
The biggest performance gain is for "table_2.col_b = 'Y'" to be used as an "range restricting scan" criteria. This is when Redshift uses a WHERE statement to make block metadata comparisons to limit the amount of table data scanned. This simple comparison of a column to a constant in the ON clause SHOULD be recognized by the optimizer as a WHERE clause and used for restricted scanning. You can look for this in the STL_SCAN system table for the query as there is column called "is_rrscan" that indicates that a restricted scan was used. The table metadata needs to be up to date as well so make sure ANALYZE has been run.
In general it is a good idea to put joining clauses in the ON clause (not in the WHERE) and WHERE clauses not in the ON. This improves readability and stops you from being at the mercy of the optimizer's whims.