Snowflake: most efficient way to join two tables that use the WITH clause instead of sub-queries-CodePudding

Here's the code I've written. Just wondering if there's any way to make it more efficient. I've self joining two separate tables to each other and then, I want to join the result of both to eachother:

SELECT 
T1.CITY,
T1.TIMESTAMP,
T1.VALUE AS PARTICULATE_LEVEL,
T2.VALUE AS POLLUTION_LEVEL
FROM
(

WITH PARTICULATE_DATA (CITY,TIMESTAMP, VALUE) AS
(
SELECT LOCATION,
       TIMESTAMP,
      (MICRO_VALUE * 0.097) AS VALUE
  
FROM 
CONSOLIDATED_TABLE
)

select a.*,
    max(b.VALUE) as MAX_PAR_PREVIOUS_24_HOURS,
from PARTICULATE_DATA as a
left join PARTICULATE_DATA as b
    on a.CITY = b.CITY and b.TIMESTAMP between dateadd(day, -1, a.TIMESTAMP) and a.TIMESTAMP
group by 1,2,3,
order by 1,2

) T1
INNER JOIN 
(

WITH POLLUTION_DATA (CITY,TIMESTAMP, VALUE) AS
(
SELECT LOCATION,
       TIMESTAMP,
      (VALUE ) AS VALUE
  
FROM 
CONSOLIDATED_TABLE_2
)

select a.*,
    max(b.VALUE) as MAX_POLLUTION_LEVEL_PREVIOUS_24_HOURS,
from POLLUTION_DATA as c
left join POLLUTION_DATA as d
    on a.CITY = b.CITY and b.TIMESTAMP between dateadd(day, -1, a.TIMESTAMP) and a.TIMESTAMP
group by 1,2,3,
order by 1,2

) T2
ON T1.CITY = T2.CITY

CodePudding user response：

Well there are many things that are happening the are waste of time:

The two ORDER BY's that should not be present. And if you must self join the DATEADD should be moved into the CTE, to improve that section:

    WITH particulate_data (city, timestamp, value) AS (
        SELECT 
            location,
            timestamp,
            dateadd(day, -1, timestamp) AS ts_m1day,
            micro_value * 0.097 AS value
        FROM consolidated_table
    )
    select 
        a.location,
        a.timestamp,
        a.value,
        max(b.value) AS max_par_previous_24_hours
    FROM particulate_data AS a
    LEFT JOIN particulate_data AS b
        ON a.city = b.city 
            AND b.timestamp BETWEEN a.ts_m1day AND a.TIMESTAMP
    GROUP BY 1,2,3

Then on the join between T1 and T2 you are only joining on CITY, but you will get T2 dates for EVERY T1 timestamp. Now it might be clear that these two data sources might not overlap.

But if you rolled the data up (truncated) the timestamp to the day/date then you can aggregate those values in the MAX and thus avoid the self join, that have a much simpler window, of 'max value in the this "day"' verse the last 24h of each record.

How I would write this SQL:

WITH particulate_data AS (
    SELECT 
        location as city,
        timestamp::date as day,
        max(micro_value * 0.097) AS max_par
    FROM consolidated_table
    GROUP BY 1,2
), pollution_data AS (
    SELECT 
        location as city,
        timestamp::date as day,
        max(VALUE) as max_pollution_level
    FROM consolidated_table_2
    GROUP BY 1,2
)
SELECT 
    t1.city,
    t1.day,
    t1.value AS particulate_level,
    t2.value AS pollution_level
FROM particulate_data AS t1
JOIN pollution_data AS T2
    ON t1.city = t2.city AND t1.day = t2.day