Here's the code I've written. Just wondering if there's any way to make it more efficient. I've self joining two separate tables to each other and then, I want to join the result of both to eachother:
SELECT
T1.CITY,
T1.TIMESTAMP,
T1.VALUE AS PARTICULATE_LEVEL,
T2.VALUE AS POLLUTION_LEVEL
FROM
(
WITH PARTICULATE_DATA (CITY,TIMESTAMP, VALUE) AS
(
SELECT LOCATION,
TIMESTAMP,
(MICRO_VALUE * 0.097) AS VALUE
FROM
CONSOLIDATED_TABLE
)
select a.*,
max(b.VALUE) as MAX_PAR_PREVIOUS_24_HOURS,
from PARTICULATE_DATA as a
left join PARTICULATE_DATA as b
on a.CITY = b.CITY and b.TIMESTAMP between dateadd(day, -1, a.TIMESTAMP) and a.TIMESTAMP
group by 1,2,3,
order by 1,2
) T1
INNER JOIN
(
WITH POLLUTION_DATA (CITY,TIMESTAMP, VALUE) AS
(
SELECT LOCATION,
TIMESTAMP,
(VALUE ) AS VALUE
FROM
CONSOLIDATED_TABLE_2
)
select a.*,
max(b.VALUE) as MAX_POLLUTION_LEVEL_PREVIOUS_24_HOURS,
from POLLUTION_DATA as c
left join POLLUTION_DATA as d
on a.CITY = b.CITY and b.TIMESTAMP between dateadd(day, -1, a.TIMESTAMP) and a.TIMESTAMP
group by 1,2,3,
order by 1,2
) T2
ON T1.CITY = T2.CITY
CodePudding user response:
Well there are many things that are happening the are waste of time:
The two ORDER BY's that should not be present. And if you must self join the DATEADD should be moved into the CTE, to improve that section:
WITH particulate_data (city, timestamp, value) AS (
SELECT
location,
timestamp,
dateadd(day, -1, timestamp) AS ts_m1day,
micro_value * 0.097 AS value
FROM consolidated_table
)
select
a.location,
a.timestamp,
a.value,
max(b.value) AS max_par_previous_24_hours
FROM particulate_data AS a
LEFT JOIN particulate_data AS b
ON a.city = b.city
AND b.timestamp BETWEEN a.ts_m1day AND a.TIMESTAMP
GROUP BY 1,2,3
Then on the join between T1
and T2
you are only joining on CITY, but you will get T2 dates for EVERY T1 timestamp. Now it might be clear that these two data sources might not overlap.
But if you rolled the data up (truncated) the timestamp to the day/date then you can aggregate those values in the MAX and thus avoid the self join, that have a much simpler window, of 'max value in the this "day"' verse the last 24h of each record.
How I would write this SQL:
WITH particulate_data AS (
SELECT
location as city,
timestamp::date as day,
max(micro_value * 0.097) AS max_par
FROM consolidated_table
GROUP BY 1,2
), pollution_data AS (
SELECT
location as city,
timestamp::date as day,
max(VALUE) as max_pollution_level
FROM consolidated_table_2
GROUP BY 1,2
)
SELECT
t1.city,
t1.day,
t1.value AS particulate_level,
t2.value AS pollution_level
FROM particulate_data AS t1
JOIN pollution_data AS T2
ON t1.city = t2.city AND t1.day = t2.day