Looking for a Cumulative Total for Unique Customers Only Month by Month-CodePudding

I am trying to create a cumulative count for unique customers only by the month they purchased. The example Table is:

| customer_email | cohortMonth |
| -------- | -------------- |
| [email protected]| 10/2019      |
| [email protected]| 10/2019      |
| [email protected]| 10/2019      |
| [email protected]| 11/2019      |
| [email protected]| 11/2019      |
| [email protected]| 12/2019      |

The output I am looking for is the total Customers for 10/2019 would be 3, The cumulative total customers for 11/2019 would be 4 taking all of the customers purchased in 10/2019 and adding [email protected] as this is the only Unique customer email for the month. The cumulative total customers for 12/2019 will still be 4 as no new customers purchased in this month.

| cohortMonth | cumulative_total_customers|
| -------- | -------------- |
| 10/2019| 3|
| 11/2019| 4|
| 12/2019| 4|

CodePudding user response：

try:

with dataset as (
    SELECT 
        customer,
        date_trunc('month', date(purchase_date)) purchase_date 
    FROM (
            VALUES 
                ('[email protected]', '2019-10-05'),
                ('[email protected]', '2019-10-10'),
                ('[email protected]', '2019-10-15'),
                ('[email protected]', '2019-11-03'),
                ('[email protected]', '2019-11-28'),
                ('[email protected]', '2019-12-12')
        ) AS x (customer, purchase_date)
)


SELECT purchase_date,
       COUNT( customer ) AS daily_cumulative_count
FROM ( 
    SELECT 
        a.purchase_date AS purchase_date,
        b.customer
    FROM dataset AS a
    INNER JOIN dataset AS b ON a.purchase_date >= b.purchase_date
    GROUP BY 
        a.purchase_date,
        b.customer
     ) 
GROUP BY 
    purchase_date
;

CodePudding user response：

The ideal solution would be to use a window function with a DISTINCT clause so that to eliminate the dupplicated emails :

SELECT DISTINCT ON (cohortMonth)
       cohortMonth
     , count(DISTINCT customer_email) OVER (ORDER BY cohortMonth RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_total_customers
  FROM mytable

But we get the postgres error :

DISTINCT is not implemented for window functions

So in order to eliminate the dupplicates, we have two solutions :

Solution 1 : build the array of customer_email by cohortMonth and then eliminate the dupplicates with a query

SELECT l.cohortMonth, array_length(array_agg(DISTINCT a), 1) AS cumulative_total_customers
FROM
   ( SELECT DISTINCT ON (cohortMonth)
            cohortMonth
          , array_agg(customer_email) OVER (ORDER BY cohortMonth RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS email_array
       FROM mytable
   ) AS l
CROSS JOIN LATERAL unnest(l.email_array) AS a
GROUP BY l.cohortMonth

Solution 2 : build the array of customer_email by cohortMonth and then eliminate the dupplicates with a specific function count_distinct_array_element

CREATE OR REPLACE FUNCTION count_distinct_array_element(x anyarray) RETURNS bigint LANGUAGE sql AS $$
SELECT count(DISTINCT e) FROM unnest(x) AS e ; $$ ;

SELECT DISTINCT ON (cohortMonth)
       cohortMonth
     , count_distinct_array_element(array_agg(customer_email) OVER (ORDER BY cohortMonth RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS cumulative_total_customers
  FROM mytable

Result :

cohortmonth	cumulative_total_customers
10/2019	3
11/2019	4
12/2019	4

see test results in dbfiddle