Home > Blockchain >  Snowflake query engine strategy on several with query conditions
Snowflake query engine strategy on several with query conditions

Time:10-21

I am doing a migration job from pyspark query into snowflake query and wondering which option is better between below A, B options.

To avoid unnecessary query, I would like to go B option if there are not that significant performance difference.

In B option, does snowflake query engine automatically optimize and internally behaves like the A option?

A option

With A1 AS (select * from a1 where date='2021-10-20'),
A2 AS (select * from a2 where date='2021-10-20'),
A3 AS (select * from a3 where date='2021-10-20'),
A4 AS (select * from a4 where date='2021-10-20'),
A5 AS (select * from a5 where date='2021-10-20')
SELECT *
FROM final_merged_table

and B option

With A1 AS (select * from a1),
A2 AS (select * from a2),
A3 AS (select * from a3),
A4 AS (select * from a4),
A5 AS (select * from a5)
SELECT *
FROM final_merged_table
WHERE date = '2021-10-20'

CodePudding user response:

We can test this. First, let's construct a table with a week of dates and several million rows:

create or replace table one_week2
as
select '2020-04-01'::date   (7*seq8()/100000000)::int day, random() data, random() data2, random() data3
from table(generator(rowcount => 100000000))

Now we can write both queries to go over this table:

Option 1:

With A1 AS (select * from one_week2 where day='2020-04-05'),
A2 AS (select * from one_week2 where day='2020-04-05'),
A3 AS (select * from one_week2 where day='2020-04-05'),
A4 AS (select * from one_week2 where day='2020-04-05'),
A5 AS (select * from one_week2 where day='2020-04-05'),
final_merged_table as (
    select * from a1 
    union all select * from a2
    union all select * from a3
    union all select * from a4
    union all select * from a5)

SELECT count(*)
FROM final_merged_table

Option 2:

With A1 AS (select * from one_week2),
A2 AS (select * from one_week2),
A3 AS (select * from one_week2),
A4 AS (select * from one_week2),
A5 AS (select * from one_week2),
final_merged_table as (
    select * from a1 
    union all select * from a2
    union all select * from a3
    union all select * from a4
    union all select * from a5)

SELECT count(*)
FROM final_merged_table
where day='2020-04-05'
;

When we run these queries, the profile for both looks identical - as the filter has been pushed down:

Option 1 profile

enter image description here

Option 2 profile

enter image description here

In summary

You can trust the Snowflake optimizer.

Trust is important, but also verify: Sometimes the optimizer can get confused by convoluted CTEs. Sometimes the Snowflake engs optimize the optimizer, and what doesn't work today, can work better tomorrow.

enter image description here

  • Related