Home > Net >  Rolling COUNT DISTINCT of n-day active users using T-SQL
Rolling COUNT DISTINCT of n-day active users using T-SQL

Time:05-13

I am counting 7-day active users using T-SQL. I used the following code:

SELECT 
    *, 
    COUNT(DISTINCT [UserID]) OVER (
        PARTITION BY [HospitalID], [HospitalName], [Device]
        ORDER BY [Date]
        ROWS 7 PRECEDING
    ) AS [7-Day Active Users]
FROM UserActivity
ORDER BY [HospitalID], [HospitalName], [Device], [Date]

I was told Use of DISTINCT is not allowed with the OVER clause. UserActivity is a table with columns HospitalID, HospitalName, Device (either phone or tablet), Date and UserID (could be NULL). To make things easier, I have filled the gaps between dates which made Date consecutive so I can use ROWS 7 PRECEDING with confidence. I did a lot of searches online and found most solution are either using other types of SQL (which is not possible in my case) or using DENSE_RANK function which does not support a moving window. What is the correct and hopefully simpler, concise way of solving my problem?

Sample Data: https://docs.google.com/spreadsheets/d/19vrBK8ixpiPJycRjb1ekiKnEUYk5AaUH/edit?usp=sharing&ouid=110206477774349430845&rtpof=true&sd=true

CodePudding user response:

Sorry to see that COUNT DISTINCT was not supported in that type of SQL... I hadn't known that. Especially after you went to the trouble of fixing the gaps between dates!

I used Rasgo to generate the SQL -- so this won't work directly in your version (tested with Snowflake), but I think it will work as long as you fix the DATEADD function. Every RDBMS seems to do DATEADD differently, it seems.

The general concept here is to join the data upon itself using a range join condition in the WHERE clause.

Luckily, this should work for you without having to fix the gaps in the dates first.

WITH BASIC_OFFSET_7DAY AS (
  SELECT 
    A.HOSPITALNAME, 
    A.HOSPITALID, 
    A.DEVICE, 
    A.DATE, 
    COUNT(DISTINCT B.USERID) as COUNT_DISTINCT_USERID_PAST7DAY, 
    COUNT(1) AS AGG_ROW_COUNT 
  FROM 
    UserActivity A 
    INNER JOIN UserActivity B ON A.HOSPITALNAME = B.HOSPITALNAME 
    AND A.HOSPITALID = B.HOSPITALID 
    AND A.DEVICE = B.DEVICE 
  WHERE 
    B.DATE >= DATEADD(day, -7, A.DATE) 
    AND B.DATE <= A.DATE 
  GROUP BY 
    A.HOSPITALNAME, 
    A.HOSPITALID, 
    A.DEVICE, 
    A.DATE
) 
SELECT 
  src.*, 
  BASIC_OFFSET_7DAY.COUNT_DISTINCT_USERID_PAST7DAY 
FROM 
  UserActivity src 
  LEFT OUTER JOIN BASIC_OFFSET_7DAY ON BASIC_OFFSET_7DAY.DATE = src.DATE 
  AND BASIC_OFFSET_7DAY.HOSPITALNAME = src.HOSPITALNAME 
  AND BASIC_OFFSET_7DAY.HOSPITALID = src.HOSPITALID 
  AND BASIC_OFFSET_7DAY.DEVICE = src.DEVICE

Let me know how that works out and if it doesn't work I'll help you out.

Edit: For those who are trying to do this and getting stuck, a common mistake (one that I myself performed when I did this by hand) is to pay careful attention to COUNT(DISTINCT(B.col)) and not A.col. When I used Rasgo to generate the SQL to check myself, I caught my mistake. Hopefully this note helps someone in the future who makes this same mistake!

  • Related