I have my table like this:
event_date | user_id |
---|---|
2021-10-01 | 1 |
2021-10-01 | 2 |
2021-10-02 | 1 |
2021-10-02 | 2 |
2021-10-02 | 3 |
2021-10-03 | 3 |
2021-10-03 | 4 |
2021-10-03 | 4 |
2021-10-04 | 5 |
2021-10-04 | 2 |
2021-10-06 | 3 |
2021-10-06 | 5 |
Sample data
WITH my_table (event_date, user_id) AS (
values ('2021-10-01', 1),
('2021-10-01', 2),
('2021-10-02', 1),
('2021-10-02', 2),
('2021-10-02', 3),
('2021-10-03', 3),
('2021-10-03', 4),
('2021-10-03', 4),
('2021-10-04', 5),
('2021-10-04', 2),
('2021-10-06', 3),
('2021-10-06', 5)
)
SELECT *
FROM my_table
And I want to calculate moving count distinct user_id for the last 3 days. Desired table should look like this:
event_date | count_distinct_user_id_last_3_days |
---|---|
2021-10-01 | 2 |
2021-10-02 | 3 |
2021-10-03 | 4 |
2021-10-04 | 5 |
2021-10-05 | 4 |
2021-10-06 | 4 |
Thanks for your help!
CodePudding user response:
You can get an approximate moving count distinct via the following query below. Unfortunately, DISTINCT
is not currently supported in window functions, so you can't use COUNT(DISTINCT user_id) OVER (...)
WITH my_table (event_date, user_id) AS (
values ('2021-10-01', 1),
('2021-10-01', 2),
('2021-10-02', 1),
('2021-10-02', 2),
('2021-10-02', 3),
('2021-10-03', 3),
('2021-10-03', 4),
('2021-10-03', 4),
('2021-10-04', 5),
('2021-10-04', 2),
('2021-10-06', 3),
('2021-10-06', 5)
)
SELECT
event_date,
approx_distinct(user_id) OVER (
ORDER BY cast(event_date AS date)
RANGE BETWEEN INTERVAL '3' DAY PRECEDING AND CURRENT ROW)
FROM my_table
CodePudding user response:
Since you are missing some days in the data (and you want to show all dates) and Athena version of Presto does not support date ranges you can try approach inserting missing dates with window functions based on rows and leveraging array functions:
WITH my_table (event_date, user_id) AS (
values ('2021-10-01', 1),
('2021-10-01', 2),
('2021-10-02', 1),
('2021-10-02', 2),
('2021-10-02', 3),
('2021-10-03', 3),
('2021-10-03', 4),
('2021-10-03', 4),
('2021-10-04', 5),
('2021-10-04', 2),
('2021-10-06', 3),
('2021-10-06', 5)
),
my_table_dates as (
SELECT date(event_date) event_date, user_id
from my_table
),
grouped as(
SELECT event_date,
array_distinct(array_agg(user_id)) users
FROM my_table_dates
group by event_date
),
with_dates_ranges as (
select *,
sequence(lag(event_date) over(order by event_date) interval '1' day, event_date, interval '1' day) dates
from grouped
),
with_gaps_filled as(
select d event_date,
if(d = event_date, users, array[]) users
from with_dates_ranges,
unnest (coalesce(dates, array[event_date])) t(d) -- coalesce for the first day
)
-- final query
select date(event_date) event_date,
cardinality(array_distinct(
flatten(array_agg(users) over(order by event_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW))
)) count_distinct_user_id_last_3_days
from with_gaps_filled
Output:
event_date | count_distinct_user_id_last_3_days |
---|---|
2021-10-01 | 2 |
2021-10-02 | 3 |
2021-10-03 | 4 |
2021-10-04 | 5 |
2021-10-05 | 4 |
2021-10-06 | 3 |