I am trying to create a cumulative count for unique customers only by the month they purchased. The example Table is:
| customer_email | cohortMonth |
| -------- | -------------- |
| [email protected]| 10/2019 |
| [email protected]| 10/2019 |
| [email protected]| 10/2019 |
| [email protected]| 11/2019 |
| [email protected]| 11/2019 |
| [email protected]| 12/2019 |
The output I am looking for is the total Customers for 10/2019 would be 3, The cumulative total customers for 11/2019 would be 4 taking all of the customers purchased in 10/2019 and adding [email protected] as this is the only Unique customer email for the month. The cumulative total customers for 12/2019 will still be 4 as no new customers purchased in this month.
| cohortMonth | cumulative_total_customers|
| -------- | -------------- |
| 10/2019| 3|
| 11/2019| 4|
| 12/2019| 4|
CodePudding user response:
try:
with dataset as (
SELECT
customer,
date_trunc('month', date(purchase_date)) purchase_date
FROM (
VALUES
('[email protected]', '2019-10-05'),
('[email protected]', '2019-10-10'),
('[email protected]', '2019-10-15'),
('[email protected]', '2019-11-03'),
('[email protected]', '2019-11-28'),
('[email protected]', '2019-12-12')
) AS x (customer, purchase_date)
)
SELECT purchase_date,
COUNT( customer ) AS daily_cumulative_count
FROM (
SELECT
a.purchase_date AS purchase_date,
b.customer
FROM dataset AS a
INNER JOIN dataset AS b ON a.purchase_date >= b.purchase_date
GROUP BY
a.purchase_date,
b.customer
)
GROUP BY
purchase_date
;
CodePudding user response:
The ideal solution would be to use a window
function with a DISTINCT
clause so that to eliminate the dupplicated emails :
SELECT DISTINCT ON (cohortMonth)
cohortMonth
, count(DISTINCT customer_email) OVER (ORDER BY cohortMonth RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_total_customers
FROM mytable
But we get the postgres error :
DISTINCT is not implemented for window functions
So in order to eliminate the dupplicates, we have two solutions :
Solution 1 : build the array of customer_email by cohortMonth and then eliminate the dupplicates with a query
SELECT l.cohortMonth, array_length(array_agg(DISTINCT a), 1) AS cumulative_total_customers
FROM
( SELECT DISTINCT ON (cohortMonth)
cohortMonth
, array_agg(customer_email) OVER (ORDER BY cohortMonth RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS email_array
FROM mytable
) AS l
CROSS JOIN LATERAL unnest(l.email_array) AS a
GROUP BY l.cohortMonth
Solution 2 : build the array of customer_email by cohortMonth and then eliminate the dupplicates with a specific function count_distinct_array_element
CREATE OR REPLACE FUNCTION count_distinct_array_element(x anyarray) RETURNS bigint LANGUAGE sql AS $$
SELECT count(DISTINCT e) FROM unnest(x) AS e ; $$ ;
SELECT DISTINCT ON (cohortMonth)
cohortMonth
, count_distinct_array_element(array_agg(customer_email) OVER (ORDER BY cohortMonth RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS cumulative_total_customers
FROM mytable
Result :
cohortmonth | cumulative_total_customers |
---|---|
10/2019 | 3 |
11/2019 | 4 |
12/2019 | 4 |
see test results in dbfiddle