I want to modify this SQL query:
SELECT Count(DISTINCT contacts) AS Totaly
FROM clicks
WHERE clicks.campaign_id = 1234
AND clicks.type IN ( 1, 2, 3 )
To get clicks per day. I created this:
SELECT Cast(clicks.time AS DATE) AS 'date',
----- THE SAME CODE -----
Count(DISTINCT contacts) AS Totaly
FROM clicks
WHERE clicks.campaign_id = 1234
AND clicks.type IN ( 1, 2, 3 )
----- THE SAME CODE -----
GROUP BY Cast(clicks.time AS DATE)
The problem is that count of records from second query is not the same as from first query. I can check that with this:
WITH cte
AS (SELECT Cast(clicks.time AS DATE) AS 'date',
----- THE SAME CODE -----
Count(DISTINCT contacts) AS Totaly
FROM clicks
WHERE clicks.campaign_id = 1234
AND clicks.type IN ( 1, 2, 3 )
----- THE SAME CODE -----
GROUP BY Cast(clicks.time AS DATE))
SELECT Sum(totaly)
FROM cte
So my question is - why I am not getting the same sum?
CodePudding user response:
The numbers are not comparable. COUNT(DISTINCT)
counts the number of distinct different values in the group, when you change the grouping you will change the overall total because they are no longer counting the same distinct items.
For your queries your first simply COUNT
s the DISTINCT
values of contacts
. There is no GROUP BY
so this is based purely on the whole data set. For the second query you are COUNT
ing the number of DISTINCT
values of contacts
per date. This means that the same value of contacts
can (and will) be counted more than once if they appear on different dates. AS a result when you SUM
the COUNT
from your second query you get a COUNT
of distinct contacts
and dates.
Let's take this very overly simplified example data:
CREATE TABLE dbo.SomeTable (ContactID int,
SomeDate date);
GO
INSERT INTO dbo.SomeTable (ContactID,
SomeDate)
VALUES(1,'20230101'),
(1,'20230102'),
(2,'20230102'),
(3,'20230103'),
(3,'20230104');
Now, very clearly, we can see that there are 3 different values for ContactId
, however, we can verify this:
SELECT COUNT(DISTINCT ContactID)
FROM dbo.SomeTable;
Now let's get a DISTINCT
COUNT
of ContactID
for each value of SomeDate
. This will give us the following result:
SomeDate | DistinctContacts |
---|---|
2023-01-01 | 1 |
2023-01-02 | 2 |
2023-01-03 | 1 |
2023-01-04 | 1 |
And to verify:
SELECT SomeDate,
COUNT(DISTINCT ContactID) AS DistinctContacts
FROM dbo.SomeTable
GROUP BY SomeDate;
So we can clearly see that the total, 5
, here is quite different but it isstill correct. The total is now representative of the DISTINCT
values of contactID
and SomeDate
.
If you wanted to get both the DISTINCT
count by SomeDate
and the total you could use ROLLUP
or GROUING SETS
. I demonstrate both, however, due to the over simplification they return the same result; for other queries they might not (as ROLLUP
might include more groups):
SELECT SomeDate,
COUNT(DISTINCT ContactID) AS DistinctContacts
FROM dbo.SomeTable
GROUP BY SomeDate
WITH ROLLUP;
SELECT SomeDate,
COUNT(DISTINCT ContactID) AS DistinctContacts
FROM dbo.SomeTable
GROUP BY GROUPING SETS((SomeDate),());
This returns the following:
SomeDate | DistinctContacts |
---|---|
2023-01-01 | 1 |
2023-01-02 | 2 |
2023-01-03 | 1 |
2023-01-04 | 1 |
NULL | 3 |
Note that for the row where SomeDate
has the value NULL
3
is returned, not 5
.