so I have a table that's sort of like this:
DELIVERY_AREA_ID | DELIVERY_RADIUS_METERS | EVENT_STARTED_TIMESTAMP |
---|---|---|
234sfd | 4000 | 2020-01-01 12:19:29.719 |
234sfd | 6500 | 2020-01-01 12:31:40.325 |
234sfd | 3500 | 2020-01-01 12:53:10.538 |
234sfd | 6500 | 2020-01-01 13:11:36.094 |
234sfd | 3500 | 2020-01-01 13:32:26.754 |
234sfd | 6500 | 2020-01-01 13:59:11.104 |
234sfd | 6500 | 2020-01-02 07:44:16.792 |
234sfd | 3500 | 2020-01-02 08:07:36.284 |
234sfd | 6500 | 2020-01-02 08:54:08.014 |
234sfd | 3500 | 2020-01-02 09:53:05.853 |
234sfd | 6500 | 2020-01-02 10:04:39.443 |
234sfd | 10000 | 2020-07-01 08:29:20.194 |
234sfd | 3500 | 2020-07-03 07:50:41.782 |
234sfd | 10000 | 2020-07-03 08:33:14.695 |
234sfd | 3500 | 2020-07-05 07:47:05.539 |
234sfd | 10000 | 2020-07-05 07:53:13.930 |
234sfd | 3500 | 2020-07-05 09:18:57.688 |
234sfd | 10000 | 2020-07-05 09:51:07.547 |
234sfd | 3500 | 2020-07-19 18:02:14.099 |
the data is actually much more varied but yeah it follows that format.
I am trying to, in one query, in snowflake database, make a get the top ranked radius by duration. I currently have this:
SELECT DELIVERY_AREA_ID,
MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
MONTH_YEAR,
DELIVERY_RADIUS_METERS,
SUM(DURATION_SECONDS) AS total_duration,
MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP,
RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
ORDER BY SUM(DURATION_SECONDS) DESC) AS RADIUS_RANK
FROM (
-- Add the MONTH_YEAR column to the delivery_radius_log table
SELECT DELIVERY_AREA_ID,
DELIVERY_RADIUS_METERS,
EVENT_STARTED_TIMESTAMP,
CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
FROM delivery_radius_log
) t -- added alias here
GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
I want to get the first rank for each month_year but when I use
where RADIUS_RANK = 1
I get an error: Syntax error: unexpected 'where'. (line 21)
Im not sure how to resolve this
I have tried this link which appears to have the same question but the solution is already what I am trying.
CodePudding user response:
It is not possible to solve this scenario without querying the output of your query, in other words, using the output of that query as an input for another top-level query.
- You can not use a field produced at the projection level in the WHERE clause
- You can not use analytic functions in the WHERE clause
- You can not use analytic functions in a HAVING clause
So the only solution is to query the output of that query and retrieve only the MIN rank.
CodePudding user response:
To filter windowed function at the same query level you need to use QUALIFY clause:
SELECT DELIVERY_AREA_ID,
MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
MONTH_YEAR,
DELIVERY_RADIUS_METERS,
SUM(DURATION_SECONDS) AS total_duration,
MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP,
RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
ORDER BY SUM(DURATION_SECONDS) DESC) AS RADIUS_RANK
FROM (
-- Add the MONTH_YEAR column to the delivery_radius_log table
SELECT DELIVERY_AREA_ID,
DELIVERY_RADIUS_METERS,
EVENT_STARTED_TIMESTAMP,
CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
FROM delivery_radius_log
) t -- added alias here
GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
QUALIFY RADIUS_RANK = 1;
If the rank column is not required then the entire expression could be moved:
SELECT DELIVERY_AREA_ID,
MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
MONTH_YEAR,
DELIVERY_RADIUS_METERS,
SUM(DURATION_SECONDS) AS total_duration,
MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP
FROM (
-- Add the MONTH_YEAR column to the delivery_radius_log table
SELECT DELIVERY_AREA_ID,
DELIVERY_RADIUS_METERS,
EVENT_STARTED_TIMESTAMP,
CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
FROM delivery_radius_log
) t -- added alias here
GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
QUALIFY RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
ORDER BY SUM(DURATION_SECONDS) DESC) = 1;