I am trying to find out the count of sequential occurrences and their lengths for a certain user action. Let's say it is monthly logins. Below is a simplified structure of my table and my expected end result.
PostgreSQL Schema Setup:
CREATE TABLE user_actions (
action_date VARCHAR(255),
user_id VARCHAR(255)
);
INSERT INTO user_actions(action_date, user_id)
VALUES('2020-03', 'alex01'),
('2020-04', 'alex01'),
('2020-05', 'alex01'),
('2020-06', 'alex01'),
('2020-12', 'alex01'),
('2021-01', 'alex01'),
('2021-02', 'alex01'),
('2021-03', 'alex01'),
('2020-04', 'jon03'),
('2020-05', 'jon03'),
('2020-06', 'jon03'),
('2020-09', 'jon03'),
('2021-11', 'jon03'),
('2021-12', 'jon03'),
('2022-01', 'jon03'),
('2022-02', 'jon03'),
('2020-05', 'mark05'),
('2020-06', 'mark05'),
('2020-07', 'mark05'),
('2020-08', 'mark05'),
('2020-09', 'mark05')
- alex01 have 2 streak with length of 4
- jon03 have 3 streaks with length of 1, 3 and 4
- mark05 have 1 streak with length of 5 And we would like to get the occurrence counts of longest streaks per user
Streak Length | # of occurrences |
---|---|
5 | 1 |
4 | 2 |
I have tried to research if this type of a problem was asked before and found this question . Although the answer was quite helpful but not exactly what I need
CodePudding user response:
You can follow these steps to go from your varchar dates to your final output:
- format your dates with
DATE
datatype, to allow for dates comparison - compute when each date value is not consecutive for each "user_id", flag it with 1 and that will represent the beginning of a new partition
- compute a running sum on your flag, to generate your partition values
- retrieve the biggest count of dates within each partition
- count how many counts you have found in the previous step
Each of these steps is made within a subquery:
WITH cte AS (
SELECT user_id,
TO_DATE(action_date || '-01','YYYY-MM-DD') AS date_as_date
FROM user_actions
), islands AS (
SELECT *,
CASE WHEN LAG(date_as_date) OVER(PARTITION BY user_id ORDER BY date_as_date) INTERVAL '1 month' <> date_as_date
THEN 1 ELSE 0
END AS new_partition
FROM cte
), partitioned_islands AS (
SELECT user_id,
SUM(new_partition) OVER(PARTITION BY user_id ORDER BY date_as_date) AS partitions
FROM islands
), count_per_island AS (
SELECT COUNT(*) AS num_consecutive_dates
FROM partitioned_islands
GROUP BY user_id, partitions
ORDER BY ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY COUNT(*) DESC)
FETCH FIRST 1 ROWS WITH TIES
)
SELECT num_consecutive_dates AS streakLength,
COUNT(*) AS numOfOccurrences
FROM count_per_island
GROUP BY streakLength
ORDER BY streakLength DESC
Check the demo here.