Home > other >  Different count amount using grouping
Different count amount using grouping

Time:02-01

I want to modify this SQL query:

SELECT Count(DISTINCT contacts) AS Totaly
FROM   clicks
WHERE  clicks.campaign_id = 1234
       AND clicks.type IN ( 1, 2, 3 )  

To get clicks per day. I created this:

SELECT Cast(clicks.time AS DATE) AS 'date',
       
----- THE SAME CODE -----
Count(DISTINCT contacts) AS Totaly
FROM   clicks
WHERE  clicks.campaign_id = 1234
       AND clicks.type IN ( 1, 2, 3 )
----- THE SAME CODE -----
        
GROUP  BY Cast(clicks.time AS DATE)   

The problem is that count of records from second query is not the same as from first query. I can check that with this:

WITH cte
     AS (SELECT Cast(clicks.time AS DATE) AS 'date',
                
                ----- THE SAME CODE -----
                Count(DISTINCT contacts) AS Totaly
         FROM   clicks
         WHERE  clicks.campaign_id = 1234
                AND clicks.type IN ( 1, 2, 3 )
                ----- THE SAME CODE -----
                
         GROUP  BY Cast(clicks.time AS DATE))
SELECT Sum(totaly)
FROM   cte

So my question is - why I am not getting the same sum?

CodePudding user response:

The numbers are not comparable. COUNT(DISTINCT) counts the number of distinct different values in the group, when you change the grouping you will change the overall total because they are no longer counting the same distinct items.

For your queries your first simply COUNTs the DISTINCT values of contacts. There is no GROUP BY so this is based purely on the whole data set. For the second query you are COUNTing the number of DISTINCT values of contacts per date. This means that the same value of contacts can (and will) be counted more than once if they appear on different dates. AS a result when you SUM the COUNT from your second query you get a COUNT of distinct contacts and dates.

Let's take this very overly simplified example data:

CREATE TABLE dbo.SomeTable (ContactID int,
                            SomeDate date);
GO

INSERT INTO dbo.SomeTable (ContactID,
                           SomeDate)
VALUES(1,'20230101'),
      (1,'20230102'),
      (2,'20230102'),
      (3,'20230103'),
      (3,'20230104');

Now, very clearly, we can see that there are 3 different values for ContactId, however, we can verify this:

SELECT COUNT(DISTINCT ContactID)
FROM dbo.SomeTable;

Now let's get a DISTINCT COUNT of ContactID for each value of SomeDate. This will give us the following result:

SomeDate DistinctContacts
2023-01-01 1
2023-01-02 2
2023-01-03 1
2023-01-04 1

And to verify:

SELECT SomeDate,
       COUNT(DISTINCT ContactID) AS DistinctContacts
FROM dbo.SomeTable
GROUP BY SomeDate;

So we can clearly see that the total, 5, here is quite different but it isstill correct. The total is now representative of the DISTINCT values of contactID and SomeDate.

If you wanted to get both the DISTINCT count by SomeDate and the total you could use ROLLUP or GROUING SETS. I demonstrate both, however, due to the over simplification they return the same result; for other queries they might not (as ROLLUP might include more groups):

SELECT SomeDate,
       COUNT(DISTINCT ContactID) AS DistinctContacts
FROM dbo.SomeTable
GROUP BY SomeDate
WITH ROLLUP;

SELECT SomeDate,
       COUNT(DISTINCT ContactID) AS DistinctContacts
FROM dbo.SomeTable
GROUP BY GROUPING SETS((SomeDate),());

This returns the following:

SomeDate DistinctContacts
2023-01-01 1
2023-01-02 2
2023-01-03 1
2023-01-04 1
NULL 3

Note that for the row where SomeDate has the value NULL 3 is returned, not 5.

  • Related