Home > Software engineering >  Rolling distinct count for last 3 days (Presto DB)
Rolling distinct count for last 3 days (Presto DB)

Time:07-28

I have my table like this:

event_date user_id
2021-10-01 1
2021-10-01 2
2021-10-02 1
2021-10-02 2
2021-10-02 3
2021-10-03 3
2021-10-03 4
2021-10-03 4
2021-10-04 5
2021-10-04 2
2021-10-06 3
2021-10-06 5

Sample data

WITH my_table (event_date, user_id) AS (
    values ('2021-10-01', 1),
        ('2021-10-01', 2),
        ('2021-10-02', 1),
        ('2021-10-02', 2),
        ('2021-10-02', 3),
        ('2021-10-03', 3),
        ('2021-10-03', 4),
        ('2021-10-03', 4),
        ('2021-10-04', 5),
        ('2021-10-04', 2),
        ('2021-10-06', 3),
        ('2021-10-06', 5)
) 
SELECT *
FROM my_table

And I want to calculate moving count distinct user_id for the last 3 days. Desired table should look like this:

event_date count_distinct_user_id_last_3_days
2021-10-01 2
2021-10-02 3
2021-10-03 4
2021-10-04 5
2021-10-05 4
2021-10-06 4

Thanks for your help!

CodePudding user response:

You can get an approximate moving count distinct via the following query below. Unfortunately, DISTINCT is not currently supported in window functions, so you can't use COUNT(DISTINCT user_id) OVER (...)

WITH my_table (event_date, user_id) AS (
    values ('2021-10-01', 1),
        ('2021-10-01', 2),
        ('2021-10-02', 1),
        ('2021-10-02', 2),
        ('2021-10-02', 3),
        ('2021-10-03', 3),
        ('2021-10-03', 4),
        ('2021-10-03', 4),
        ('2021-10-04', 5),
        ('2021-10-04', 2),
        ('2021-10-06', 3),
        ('2021-10-06', 5)
)
SELECT 
    event_date, 
    approx_distinct(user_id) OVER (
        ORDER BY cast(event_date AS date) 
        RANGE BETWEEN INTERVAL '3' DAY PRECEDING AND CURRENT ROW)
FROM my_table

CodePudding user response:

Since you are missing some days in the data (and you want to show all dates) and Athena version of Presto does not support date ranges you can try approach inserting missing dates with window functions based on rows and leveraging array functions:

WITH my_table (event_date, user_id) AS (
    values ('2021-10-01', 1),
        ('2021-10-01', 2),
        ('2021-10-02', 1),
        ('2021-10-02', 2),
        ('2021-10-02', 3),
        ('2021-10-03', 3),
        ('2021-10-03', 4),
        ('2021-10-03', 4),
        ('2021-10-04', 5),
        ('2021-10-04', 2),
        ('2021-10-06', 3),
        ('2021-10-06', 5)
),
my_table_dates as (
    SELECT date(event_date) event_date, user_id
    from my_table
),
grouped as(
    SELECT event_date, 
        array_distinct(array_agg(user_id)) users
    FROM my_table_dates
    group by event_date
),
with_dates_ranges as (
    select *, 
        sequence(lag(event_date) over(order by event_date)   interval '1' day, event_date, interval '1' day) dates
    from grouped
),
with_gaps_filled as(
    select d event_date,
        if(d = event_date, users, array[]) users
    from with_dates_ranges,
    unnest (coalesce(dates, array[event_date])) t(d) -- coalesce for the first day
)

-- final query
select date(event_date) event_date,
    cardinality(array_distinct(
        flatten(array_agg(users) over(order by event_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW))
    )) count_distinct_user_id_last_3_days
from with_gaps_filled

Output:

event_date count_distinct_user_id_last_3_days
2021-10-01 2
2021-10-02 3
2021-10-03 4
2021-10-04 5
2021-10-05 4
2021-10-06 3
  • Related