Good evening dear Stack overflow community. This is my first question here.
I have the following issue: I need a query to count distinct values in a rolling date range (3 days) for each country.
I have done some research and found the following discussion here [1]: Query for count of distinct values in a rolling date range.
For the problem I face I need grouping not only by date, but also by country. Please consider the following input table:
Date | Country | |
---|---|---|
1/1/12 | DE | [email protected] |
1/1/12 | FRA | [email protected] |
1/1/12 | SPA | [email protected] |
1/2/12 | DE | [email protected] |
1/2/12 | DE | [email protected] |
1/3/12 | SPA | [email protected] |
1/3/12 | SPA | [email protected] |
1/3/12 | FRA | [email protected] |
1/4/12 | SPA | [email protected] |
1/4/12 | FRA | [email protected] |
1/4/12 | FRA | [email protected] |
1/4/12 | SPA | [email protected] |
The expect outcome with counted distinct emails will be the following:
Date | Country | |
---|---|---|
1/1/12 | DE | 1 |
1/1/12 | FRA | 1 |
1/1/12 | SPA | 1 |
1/2/12 | DE | 2 |
1/2/12 | FRA | 1 |
1/2/12 | SPA | 1 |
1/3/12 | SPA | 2 |
1/3/12 | DE | 2 |
1/3/12 | FRA | 2 |
1/4/12 | SPA | 3 |
1/4/12 | FRA | 2 |
1/4/12 | DE | 2 |
I was trying to modify the solution suggested in the above-mentioned discussion and modify the following by adding country in selected columns and in grouping.
SELECT date
,(SELECT count(DISTINCT email)
FROM tbl
WHERE date BETWEEN g.date - 2 AND g.date
) AS dist_emails
FROM (SELECT generate_series(timestamp '2012-01-01'
, timestamp '2012-01-06'
, interval '1 day')::date) AS g(date)
Unfortunately the updated query does not work, as country is not recognised and it errors.
SELECT date, country,
,(SELECT count(DISTINCT email)
FROM tbl
WHERE date BETWEEN g.date - 2 AND g.date
) AS dist_emails
FROM (SELECT generate_series(timestamp '2012-01-01'
, timestamp '2012-01-06'
, interval '1 day')::date) AS g(date)
GROUP BY 1,2
Will be very grateful for your advice and sharing your expertise on the ways this can be fixed.
CodePudding user response:
The reason that country
doesn't exist is that we can only SELECT
columns from the tables that are in a FROM
section of a query. The nested subquery selects from tbl
but that is not available to the main, outer query, which only selects from the generated table g
. g
only has a date
column, so that's the only column the outer query can select on directly.
Another issue with the query is that the COUNT
is not factoring in countries.
For this I'd use basic join to query every row for every date range, then do a count distinct
for every date country. You can use an INNER
join to remove days that have no entries or LEFT OUTER
to return rows of {date}, nil,nil
if there are no entries for that date range. Something like:
SELECT g.date
, tbl.country
, COUNT(DISTINCT(tbl.email))
FROM (SELECT generate_series(timestamp '2012-01-01'
, timestamp '2012-01-06'
, interval '1 day')::date) AS g(date)
INNER JOIN tbl ON (tbl.date BETWEEN g.date - 2 AND g.date)
GROUP BY 1,2