redshift cumulative count records via SQL-CodePudding

I've been struggling to find an answer for this question. I think this question is similar to what i'm looking for but when i tried this it didn't work. Because there's no new unique user_id added between 02-20 and 02-27, the cumulative count will be the same. Then for 02-27, there is a unique user_id which hasn't appeared on any previous dates (6)

Here's my input

   date                user_id
2020-02-20          1
2020-02-20          2
2020-02-20          3
2020-02-20          4
2020-02-20          4
2020-02-20          5
2020-02-21          1
2020-02-22          2
2020-02-23          3
2020-02-24          4
2020-02-25          4
2020-02-27          6

Output table:

date            daily_cumulative_count
2020-02-20              5
2020-02-21              5
2020-02-22              5
2020-02-23              5
2020-02-24              5
2020-02-25              5
2020-02-27              6

This is what i tried and the result is not quite what i want

select
stat_date,count(DISTINCT user_id),
sum(count(DISTINCT user_id)) over (order by stat_date rows unbounded preceding) as cumulative_signups
from data_engineer_interview
group by stat_date
order by stat_date

it returns this instead;

date,count,cumulative_sum
2022-02-20,5,5
2022-02-21,1,6
2022-02-22,1,7
2022-02-23,1,8
2022-02-24,1,9
2022-02-25,1,10
2022-02-27,1,11

CodePudding user response：

We can do this via a correlated subquery followed by aggregation:

WITH cte AS (
    SELECT
        date,
        CASE WHEN EXISTS (
            SELECT 1
            FROM data_engineer_interview d2
            WHERE d2.date < d1.date AND
                  d2.user_id = d1.user_id
        ) THEN 0 ELSE 1 END AS flag
    FROM (SELECT DISTINCT date, user_id FROM data_engineer_interview) d1
)

SELECT date, SUM(flag) AS daily_cumulative_count
FROM cte
ORDER BY date;

CodePudding user response：

The problem with this task is that it could be done by comparing each row uniquely with all previous rows to see if there is a match in user_id. Since you are using Redshift I'll assume that your data table could be very large so attacking the problem this way will bog down in some form of a loop join.

You want to think about the problem differently to avoid this looping issue. If you derive a dataset with id and first_date_of_id you can then just do a cumulative sum sorted by date. Like this

select user_id, min("date") as first_date, 
  count(user_id) over (order by first_date rows unbounded preceding) as date_out
from data_engineer_interview
group by user_id
order by date_out;

This is untested and won't produce the full list of dates that you have in your example output but rather only the dates where new ids show up. If this is an issue it is simple to add in the additional dates with no count change.