I have the following table:
CREATE TABLE tbl (
id int NOT NULL
, date date NOT NULL
, cid int NOT NULL
);
INSERT INTO tbl VALUES
(1 , '2022-01-01', 1)
, (2 , '2022-01-01', 1)
, (3 , '2022-01-01', 2)
, (4 , '2022-01-01', 3)
, (5 , '2022-01-02', 1)
, (6 , '2022-01-02', 4)
, (7 , '2022-01-03', 5)
, (8 , '2022-01-03', 6)
, (9 , '2022-02-01', 1)
, (10, '2022-02-01', 5)
, (11, '2022-02-02', 5)
, (12, '2022-02-02', 3)
;
I'm trying to count distinct users (= cid)
each day, but the result is rolling during the month. E.g., for 2022-01-01
, only distinct users with date = 2022-01-01
are counted. For 2022-01-02
, distinct users with date between 2022-01-01 and 2022-01-02
are counted, and so on. The count should restart each month.
My desired output:
date distinct_cids
2022-01-01 3
2022-01-02 4
2022-01-03 6
2022-02-01 2
2022-02-02 3
CodePudding user response:
I don't have access to snowflake so I can't guarantee that this will work, but from the sound of it:
select date, count(distinct cid) over (partition by month(date) order by date)
from tbl
order by date;
If you have several years worth of data, you can partition by year, month:
select date, count(distinct cid) over (partition by year(date), month(date) order by date)
from tbl
order by date;
Date is a reserved word, so you may consider renaming your column
EDIT: Since distinct is disallowed you can try a vanilla SQL variant. It is likely slow for a large table:
select dt, count(cid)
from (
select distinct dt.dt, x.cid
from tbl x
join (
select distinct date as dt from tbl
) dt (dt)
on x.date <= dt.dt
and month(x.date) = month(dt.dt)
) t
group by dt
order by dt
;
The idea is that we create a new relation (t) with distinct users with a date less than or equal to the current date in the current month. Then we can just count those users for each date.