Home > Enterprise >  Count of Sequential occurrences and their lengths (advanced gaps and islands problem)
Count of Sequential occurrences and their lengths (advanced gaps and islands problem)

Time:12-20

I am trying to find out the count of sequential occurrences and their lengths for a certain user action. Let's say it is monthly logins. Below is a simplified structure of my table and my expected end result.

PostgreSQL Schema Setup:

  CREATE TABLE user_actions (
        action_date VARCHAR(255),
        user_id VARCHAR(255)
    );
INSERT INTO user_actions(action_date, user_id)
VALUES('2020-03', 'alex01'),
('2020-04', 'alex01'),
('2020-05', 'alex01'),
('2020-06', 'alex01'),
('2020-12', 'alex01'),
('2021-01', 'alex01'),
('2021-02', 'alex01'),
('2021-03', 'alex01'),
('2020-04', 'jon03'),
('2020-05', 'jon03'),
('2020-06', 'jon03'),
('2020-09', 'jon03'),
('2021-11', 'jon03'),
('2021-12', 'jon03'),
('2022-01', 'jon03'),
('2022-02', 'jon03'),
('2020-05', 'mark05'),
('2020-06', 'mark05'),
('2020-07', 'mark05'),
('2020-08', 'mark05'),
('2020-09', 'mark05')
  • alex01 have 2 streak with length of 4
  • jon03 have 3 streaks with length of 1, 3 and 4
  • mark05 have 1 streak with length of 5 And we would like to get the occurrence counts of longest streaks per user
Streak Length # of occurrences
5 1
4 2

I have tried to research if this type of a problem was asked before and found this question . Although the answer was quite helpful but not exactly what I need

CodePudding user response:

You can follow these steps to go from your varchar dates to your final output:

  • format your dates with DATE datatype, to allow for dates comparison
  • compute when each date value is not consecutive for each "user_id", flag it with 1 and that will represent the beginning of a new partition
  • compute a running sum on your flag, to generate your partition values
  • retrieve the biggest count of dates within each partition
  • count how many counts you have found in the previous step

Each of these steps is made within a subquery:

WITH cte AS (
    SELECT user_id, 
           TO_DATE(action_date || '-01','YYYY-MM-DD') AS date_as_date
    FROM user_actions
), islands AS ( 
    SELECT *, 
           CASE WHEN LAG(date_as_date) OVER(PARTITION BY user_id ORDER BY date_as_date)   INTERVAL '1 month' <> date_as_date
                THEN 1 ELSE 0 
           END AS new_partition
    FROM cte
), partitioned_islands AS (
    SELECT user_id, 
           SUM(new_partition) OVER(PARTITION BY user_id ORDER BY date_as_date) AS partitions
    FROM islands
), count_per_island AS (
    SELECT COUNT(*) AS num_consecutive_dates
    FROM partitioned_islands
    GROUP BY user_id, partitions
    ORDER BY ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY COUNT(*) DESC)
    FETCH FIRST 1 ROWS WITH TIES
)
SELECT num_consecutive_dates AS streakLength, 
       COUNT(*) AS numOfOccurrences  
FROM count_per_island
GROUP BY streakLength
ORDER BY streakLength DESC

Check the demo here.

  • Related