I am using AWS Athena (Presto based) and I have this table named base
:
id | category | year | month |
---|---|---|---|
1 | a | 2021 | 6 |
1 | b | 2022 | 8 |
1 | a | 2022 | 11 |
2 | a | 2022 | 1 |
2 | a | 2022 | 4 |
2 | b | 2022 | 6 |
I would like to craft a query that counts the distinct values of the categories per id, cumulatively per month and year, but retaining the original columns:
id | category | year | month | sumC |
---|---|---|---|---|
1 | a | 2021 | 6 | 1 |
1 | b | 2022 | 8 | 2 |
1 | a | 2022 | 11 | 2 |
2 | a | 2022 | 1 | 1 |
2 | a | 2022 | 4 | 1 |
2 | b | 2022 | 6 | 2 |
I've tried doing the following query with no success:
SELECT id,
category,
year,
month,
COUNT(category) OVER (PARTITION BY id, ORDER BY year, month) AS sumC FROM base;
This results in 1, 2, 3, 1, 2, 3
which is not what I'm looking for. I'd rather need something like a COUNT(DISTINCT)
inside a window function, though it's not supported as a construct.
I also tried the DENSE_RANK
trick:
DENSE_RANK() OVER (PARTITION BY id ORDER BY category)
DENSE_RANK() OVER (PARTITION BY id ORDER BY category)
- 1 as sumC
Though, because there is no ordering between year
and month
, it just results in 2, 2, 2, 2, 2, 2
.
Any help is appreciated!
CodePudding user response:
One option is
- creating a new column that will contain when each "category" is seen for the first time (partitioning on "id", "category" and ordering on "year", "month")
- computing a running sum over this column, with the same partition
WITH cte AS (
SELECT *,
CASE WHEN ROW_NUMBER() OVER(
PARTITION BY id, category
ORDER BY year, month) = 1
THEN 1
ELSE 0
END AS rn1
FROM base
ORDER BY id,
year_,
month_
)
SELECT id,
category,
year_,
month_,
SUM(rn1) OVER(
PARTITION BY id
ORDER BY year, month
) AS sumC
FROM cte
Does it work for you?