Count events with a cool-down period after each instance-CodePudding

In a Postgres DB I have entries for "events", associated with an id, and when they happened. I need to count them with a special rule.

When an event happens the counter is incremented and for the next 14 days all events of this type are not counted.

Example:

event	created_at	blockdate	action
16	2021-11-11 11:15	25.11.21	count
16	2021-11-11 11:15	25.11.21	block
16	2021-11-13 10:45	25.11.21	block
16	2021-11-16 10:40	25.11.21	block
16	2021-11-23 11:15	25.11.21	block
16	2021-11-23 11:15	25.11.21	block
16	2021-12-10 13:00	24.12.21	count
16	2021-12-15 13:25	24.12.21	block
16	2021-12-15 13:25	24.12.21	block
16	2021-12-15 13:25	24.12.21	block
16	2021-12-20 13:15	24.12.21	block
16	2021-12-23 13:15	24.12.21	block
16	2021-12-31 13:25	14.01.22	count
16	2022-02-05 15:00	19.02.22	count
16	2022-02-05 15:00	19.02.22	block
16	2022-02-13 17:15	19.02.22	block
16	2022-02-21 10:09	07.03.22	count
43	2021-11-26 11:00	10.12.21	count
43	2022-01-01 15:00	15.01.22	count
43	2022-04-13 10:07	27.04.22	count
43	2022-04-13 10:09	27.04.22	block
43	2022-04-13 10:09	27.04.22	block
43	2022-04-13 10:09	27.04.22	block
43	2022-04-13 10:10	27.04.22	block
43	2022-04-13 10:10	27.04.22	block
43	2022-04-13 10:47	27.04.22	block
43	2022-05-11 20:25	25.05.22	count
75	2021-10-21 12:50	04.11.21	count
75	2021-11-02 12:50	04.11.21	block
75	2021-11-18 11:15	02.12.21	count
75	2021-11-18 12:55	02.12.21	block
75	2021-11-18 16:35	02.12.21	block
75	2021-11-24 11:00	02.12.21	block
75	2021-12-01 11:00	02.12.21	block
75	2021-12-14 13:25	28.12.21	count
75	2021-12-15 13:35	28.12.21	block
75	2021-12-26 13:25	28.12.21	block
75	2022-01-31 15:00	14.02.22	count
75	2022-02-02 15:30	14.02.22	block
75	2022-02-03 15:00	14.02.22	block
75	2022-02-17 15:00	03.03.22	count
75	2022-02-17 15:00	03.03.22	block
75	2022-02-18 15:00	03.03.22	block
75	2022-02-23 15:00	03.03.22	block
75	2022-02-25 15:00	03.03.22	block
75	2022-03-04 10:46	18.03.22	count
75	2022-03-08 21:05	18.03.22	block

In Excel I simply add two columns. In one column I carry over a "blockdate", a date until when events have to be blocked. In the other column I compare the ID with the previous ID and the previous "blockdate".

When the IDs a different or the blockdate is less then the current date, I have to count. When I have to count, I set the row's blockdate to the current date 14 days, otherwise I carry over the previous blockdate.

I tried now to solve this in Postgres with ...

window functions
recursive CTEs
lateral joins

... and all seemed a bit promising, but in the end I failed to implement this tricky count.

For example, my recursive CTE failed with:

aggregate functions are not allowed in WHERE

with recursive event_count AS (
        select event
             , min(created_at) as created
        from test
        group by event
        
        union all
        
        ( select event
               , created_at as created
          from test
          join event_count
          using(event)
          where created_at >= max(created)   INTERVAL '14 days'
          order by created_at
          limit 1
        )
)
select * from event_count

Window functions, using lag() to access the previous row don't seem to work because they cannot access columns in the previous row which were created using the window function.

Adding a "block-or-count" information upon entering a new event entry by simply comparing with the last entry wouldn't solve the issue as event entries "go away" after about half a year. So when the first entry goes away, the next one becomes the first and the logic has to be applied on the new situation.

Above test data can be created with:

CREATE TABLE test ( 
  event      INTEGER, 
  created_at TIMESTAMP
);

INSERT INTO test (event, created_at) VALUES
(16, '2021-11-11 11:15'),(16, '2021-11-11 11:15'),(16, '2021-11-13 10:45'),(16, '2021-11-16 10:40'),
(16, '2021-11-23 11:15'),(16, '2021-11-23 11:15'),(16, '2021-12-10 13:00'),(16, '2021-12-15 13:25'),
(16, '2021-12-15 13:25'),(16, '2021-12-15 13:25'),(16, '2021-12-20 13:15'),(16, '2021-12-23 13:15'),
(16, '2021-12-31 13:25'),(16, '2022-02-05 15:00'),(16, '2022-02-05 15:00'),(16, '2022-02-13 17:15'),
(16, '2022-02-21 10:09'),
(43, '2021-11-26 11:00'),(43, '2022-01-01 15:00'),(43, '2022-04-13 10:07'),(43, '2022-04-13 10:09'),
(43, '2022-04-13 10:09'),(43, '2022-04-13 10:09'),(43, '2022-04-13 10:10'),(43, '2022-04-13 10:10'),
(43, '2022-04-13 10:47'),(43, '2022-05-11 20:25'),
(75, '2021-10-21 12:50'),(75, '2021-11-02 12:50'),(75, '2021-11-18 11:15'),(75, '2021-11-18 12:55'),
(75, '2021-11-18 16:35'),(75, '2021-11-24 11:00'),(75, '2021-12-01 11:00'),(75, '2021-12-14 13:25'),
(75, '2021-12-15 13:35'),(75, '2021-12-26 13:25'),(75, '2022-01-31 15:00'),(75, '2022-02-02 15:30'),
(75, '2022-02-03 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-18 15:00'),
(75, '2022-02-23 15:00'),(75, '2022-02-25 15:00'),(75, '2022-03-04 10:46'),(75, '2022-03-08 21:05');

CodePudding user response：

I cannot think of how to do this without recursion.

with recursive ordered as ( -- Order and number the event instances
  select event, created_at, 
         row_number() over (partition by event
                                order by created_at) as n
    from test
), walk as (
  -- Get and keep first instances
  select event, created_at, n, created_at as current_base, true as keep
    from ordered
   where n = 1
  union all
  -- Carry base dates forward and mark records to keep
  select c.event, c.created_at, c.n, 
         case 
           when c.created_at >= p.current_base   interval '14 days' 
             then c.created_at
           else p.current_base
         end as current_base,
         (c.created_at >= p.current_base   interval '14 days') as keep
    from walk p
         join ordered c 
           on (c.event, c.n) = (p.event, p.n   1)
)
select *
  from walk
 order by event, n;

Fiddle Here

CodePudding user response：

This lends itself to a procedural solution, since it has to walk the whole history of existing rows for each event. But SQL can do it, too.

The best solution heavily depends on cardinalities, data distribution, and other circumstances.
Assuming unfavorable conditions:

Big table.
Unknown number and identity of relevant events (event IDs).
Many rows per event.
Some overlap the 14-day time frame, some don't.
Any number of duplicates possible.

You need an index like this one:

CREATE INDEX test_event_created_at_idx ON test (event, created_at);

Then the following query emulates an index-skip scan. If the table is vacuumed enough, it operates with index-only scans exclusively, in a single pass:

WITH RECURSIVE hit AS (
   (
   SELECT event, created_at
   FROM   test
   ORDER  BY event, created_at
   LIMIT  1
   )
   
   UNION ALL
   SELECT t.*
   FROM   hit h
   CROSS  JOIN LATERAL (
      SELECT t.event, t.created_at
      FROM   test t
      WHERE  (t.event, t.created_at)
           > (h.event, h.created_at   interval '14 days')
      ORDER  BY t.event, t.created_at
      LIMIT  1
      ) t
   )
SELECT count(*) AS hits FROM hit;

fiddle

I cannot stress enough how fast it's going to be. :)

It's a recursive CTE using a LATERAL subquery, all based on the magic of ROW value comparison (which not all major RDBMS supported properly).

Effectively, we make Postgres skip over the above index once and only take qualifying rows.

For detailed explanation, see:

Different approach?

Like you mention yourself, the unfortunate task definition forces you to re-compute all newer rows for events where old data changes.

Consider working with a constant raster instead. Like a 14-day grid starting from with Jan 1 every year. Then the state of each event could be derived from the local frame. Much cheaper and more reliable.