Rolling Daily Distinct Counts Partition by Month-CodePudding

I have the following table:

CREATE TABLE tbl (
  id      int NOT NULL
, date    date NOT NULL
, cid     int NOT NULL
);
 
INSERT INTO tbl VALUES
  (1 , '2022-01-01', 1)
, (2 , '2022-01-01', 1)
, (3 , '2022-01-01', 2)
, (4 , '2022-01-01', 3)
, (5 , '2022-01-02', 1)
, (6 , '2022-01-02', 4)
, (7 , '2022-01-03', 5)
, (8 , '2022-01-03', 6)
, (9 , '2022-02-01', 1)
, (10, '2022-02-01', 5)
, (11, '2022-02-02', 5)
, (12, '2022-02-02', 3)
;

I'm trying to count distinct users (= cid) each day, but the result is rolling during the month. E.g., for 2022-01-01, only distinct users with date = 2022-01-01 are counted. For 2022-01-02, distinct users with date between 2022-01-01 and 2022-01-02 are counted, and so on. The count should restart each month.

My desired output:

   date        distinct_cids
2022-01-01           3
2022-01-02           4
2022-01-03           6
2022-02-01           2
2022-02-02           3

CodePudding user response：

I don't have access to snowflake so I can't guarantee that this will work, but from the sound of it:

select date, count(distinct cid) over (partition by month(date) order by date) 
from tbl
order by date;

If you have several years worth of data, you can partition by year, month:

select date, count(distinct cid) over (partition by year(date), month(date) order by date) 
from tbl
order by date;

Date is a reserved word, so you may consider renaming your column

EDIT: Since distinct is disallowed you can try a vanilla SQL variant. It is likely slow for a large table:

select dt, count(cid)
from (
    select distinct dt.dt, x.cid
    from tbl x
    join (
        select distinct date as dt from tbl
    ) dt (dt)
        on x.date <= dt.dt 
        and month(x.date) = month(dt.dt)
) t
group by dt
order by dt   
;

The idea is that we create a new relation (t) with distinct users with a date less than or equal to the current date in the current month. Then we can just count those users for each date.