Home > front end >  Query for count of distinct values in a rolling date range and country
Query for count of distinct values in a rolling date range and country

Time:09-16

Good evening dear Stack overflow community. This is my first question here.

I have the following issue: I need a query to count distinct values in a rolling date range (3 days) for each country.

I have done some research and found the following discussion here [1]: Query for count of distinct values in a rolling date range.

For the problem I face I need grouping not only by date, but also by country. Please consider the following input table:

Date Country Email
1/1/12 DE [email protected]
1/1/12 FRA [email protected]
1/1/12 SPA [email protected]
1/2/12 DE [email protected]
1/2/12 DE [email protected]
1/3/12 SPA [email protected]
1/3/12 SPA [email protected]
1/3/12 FRA [email protected]
1/4/12 SPA [email protected]
1/4/12 FRA [email protected]
1/4/12 FRA [email protected]
1/4/12 SPA [email protected]

The expect outcome with counted distinct emails will be the following:

Date Country Email
1/1/12 DE 1
1/1/12 FRA 1
1/1/12 SPA 1
1/2/12 DE 2
1/2/12 FRA 1
1/2/12 SPA 1
1/3/12 SPA 2
1/3/12 DE 2
1/3/12 FRA 2
1/4/12 SPA 3
1/4/12 FRA 2
1/4/12 DE 2

I was trying to modify the solution suggested in the above-mentioned discussion and modify the following by adding country in selected columns and in grouping.

SELECT date
     ,(SELECT count(DISTINCT email)
       FROM   tbl
       WHERE  date BETWEEN g.date - 2 AND g.date
      ) AS dist_emails
FROM  (SELECT generate_series(timestamp '2012-01-01'
                            , timestamp '2012-01-06'
                            , interval  '1 day')::date) AS g(date)

Unfortunately the updated query does not work, as country is not recognised and it errors.

SELECT date, country,
         ,(SELECT count(DISTINCT email)
           FROM   tbl
           WHERE  date BETWEEN g.date - 2 AND g.date
          ) AS dist_emails
    FROM  (SELECT generate_series(timestamp '2012-01-01'
                                , timestamp '2012-01-06'
                                , interval  '1 day')::date) AS g(date)
GROUP BY 1,2

Will be very grateful for your advice and sharing your expertise on the ways this can be fixed.

CodePudding user response:

The reason that country doesn't exist is that we can only SELECT columns from the tables that are in a FROM section of a query. The nested subquery selects from tbl but that is not available to the main, outer query, which only selects from the generated table g. g only has a date column, so that's the only column the outer query can select on directly.

Another issue with the query is that the COUNT is not factoring in countries.

For this I'd use basic join to query every row for every date range, then do a count distinct for every date country. You can use an INNER join to remove days that have no entries or LEFT OUTER to return rows of {date}, nil,nil if there are no entries for that date range. Something like:

SELECT g.date
     , tbl.country
     , COUNT(DISTINCT(tbl.email))
    FROM  (SELECT generate_series(timestamp '2012-01-01'
                                , timestamp '2012-01-06'
                                , interval  '1 day')::date) AS g(date)
    INNER JOIN tbl ON (tbl.date BETWEEN g.date - 2 AND g.date)

GROUP BY 1,2
  • Related