In a Postgres DB I have entries for "events", associated with an id, and when they happened. I need to count them with a special rule.
When an event happens the counter is incremented and for the next 14 days all events of this type are not counted.
Example:
event | created_at | blockdate | action |
---|---|---|---|
16 | 2021-11-11 11:15 | 25.11.21 | count |
16 | 2021-11-11 11:15 | 25.11.21 | block |
16 | 2021-11-13 10:45 | 25.11.21 | block |
16 | 2021-11-16 10:40 | 25.11.21 | block |
16 | 2021-11-23 11:15 | 25.11.21 | block |
16 | 2021-11-23 11:15 | 25.11.21 | block |
16 | 2021-12-10 13:00 | 24.12.21 | count |
16 | 2021-12-15 13:25 | 24.12.21 | block |
16 | 2021-12-15 13:25 | 24.12.21 | block |
16 | 2021-12-15 13:25 | 24.12.21 | block |
16 | 2021-12-20 13:15 | 24.12.21 | block |
16 | 2021-12-23 13:15 | 24.12.21 | block |
16 | 2021-12-31 13:25 | 14.01.22 | count |
16 | 2022-02-05 15:00 | 19.02.22 | count |
16 | 2022-02-05 15:00 | 19.02.22 | block |
16 | 2022-02-13 17:15 | 19.02.22 | block |
16 | 2022-02-21 10:09 | 07.03.22 | count |
43 | 2021-11-26 11:00 | 10.12.21 | count |
43 | 2022-01-01 15:00 | 15.01.22 | count |
43 | 2022-04-13 10:07 | 27.04.22 | count |
43 | 2022-04-13 10:09 | 27.04.22 | block |
43 | 2022-04-13 10:09 | 27.04.22 | block |
43 | 2022-04-13 10:09 | 27.04.22 | block |
43 | 2022-04-13 10:10 | 27.04.22 | block |
43 | 2022-04-13 10:10 | 27.04.22 | block |
43 | 2022-04-13 10:47 | 27.04.22 | block |
43 | 2022-05-11 20:25 | 25.05.22 | count |
75 | 2021-10-21 12:50 | 04.11.21 | count |
75 | 2021-11-02 12:50 | 04.11.21 | block |
75 | 2021-11-18 11:15 | 02.12.21 | count |
75 | 2021-11-18 12:55 | 02.12.21 | block |
75 | 2021-11-18 16:35 | 02.12.21 | block |
75 | 2021-11-24 11:00 | 02.12.21 | block |
75 | 2021-12-01 11:00 | 02.12.21 | block |
75 | 2021-12-14 13:25 | 28.12.21 | count |
75 | 2021-12-15 13:35 | 28.12.21 | block |
75 | 2021-12-26 13:25 | 28.12.21 | block |
75 | 2022-01-31 15:00 | 14.02.22 | count |
75 | 2022-02-02 15:30 | 14.02.22 | block |
75 | 2022-02-03 15:00 | 14.02.22 | block |
75 | 2022-02-17 15:00 | 03.03.22 | count |
75 | 2022-02-17 15:00 | 03.03.22 | block |
75 | 2022-02-18 15:00 | 03.03.22 | block |
75 | 2022-02-23 15:00 | 03.03.22 | block |
75 | 2022-02-25 15:00 | 03.03.22 | block |
75 | 2022-03-04 10:46 | 18.03.22 | count |
75 | 2022-03-08 21:05 | 18.03.22 | block |
In Excel I simply add two columns. In one column I carry over a "blockdate", a date until when events have to be blocked. In the other column I compare the ID with the previous ID and the previous "blockdate".
When the IDs a different or the blockdate is less then the current date, I have to count. When I have to count, I set the row's blockdate to the current date 14 days, otherwise I carry over the previous blockdate.
I tried now to solve this in Postgres with ...
- window functions
- recursive CTEs
- lateral joins
... and all seemed a bit promising, but in the end I failed to implement this tricky count.
For example, my recursive CTE failed with:
aggregate functions are not allowed in WHERE
with recursive event_count AS (
select event
, min(created_at) as created
from test
group by event
union all
( select event
, created_at as created
from test
join event_count
using(event)
where created_at >= max(created) INTERVAL '14 days'
order by created_at
limit 1
)
)
select * from event_count
Window functions, using lag()
to access the previous row don't seem to work because they cannot access columns in the previous row which were created using the window function.
Adding a "block-or-count" information upon entering a new event entry by simply comparing with the last entry wouldn't solve the issue as event entries "go away" after about half a year. So when the first entry goes away, the next one becomes the first and the logic has to be applied on the new situation.
Above test data can be created with:
CREATE TABLE test (
event INTEGER,
created_at TIMESTAMP
);
INSERT INTO test (event, created_at) VALUES
(16, '2021-11-11 11:15'),(16, '2021-11-11 11:15'),(16, '2021-11-13 10:45'),(16, '2021-11-16 10:40'),
(16, '2021-11-23 11:15'),(16, '2021-11-23 11:15'),(16, '2021-12-10 13:00'),(16, '2021-12-15 13:25'),
(16, '2021-12-15 13:25'),(16, '2021-12-15 13:25'),(16, '2021-12-20 13:15'),(16, '2021-12-23 13:15'),
(16, '2021-12-31 13:25'),(16, '2022-02-05 15:00'),(16, '2022-02-05 15:00'),(16, '2022-02-13 17:15'),
(16, '2022-02-21 10:09'),
(43, '2021-11-26 11:00'),(43, '2022-01-01 15:00'),(43, '2022-04-13 10:07'),(43, '2022-04-13 10:09'),
(43, '2022-04-13 10:09'),(43, '2022-04-13 10:09'),(43, '2022-04-13 10:10'),(43, '2022-04-13 10:10'),
(43, '2022-04-13 10:47'),(43, '2022-05-11 20:25'),
(75, '2021-10-21 12:50'),(75, '2021-11-02 12:50'),(75, '2021-11-18 11:15'),(75, '2021-11-18 12:55'),
(75, '2021-11-18 16:35'),(75, '2021-11-24 11:00'),(75, '2021-12-01 11:00'),(75, '2021-12-14 13:25'),
(75, '2021-12-15 13:35'),(75, '2021-12-26 13:25'),(75, '2022-01-31 15:00'),(75, '2022-02-02 15:30'),
(75, '2022-02-03 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-18 15:00'),
(75, '2022-02-23 15:00'),(75, '2022-02-25 15:00'),(75, '2022-03-04 10:46'),(75, '2022-03-08 21:05');
CodePudding user response:
I cannot think of how to do this without recursion.
with recursive ordered as ( -- Order and number the event instances
select event, created_at,
row_number() over (partition by event
order by created_at) as n
from test
), walk as (
-- Get and keep first instances
select event, created_at, n, created_at as current_base, true as keep
from ordered
where n = 1
union all
-- Carry base dates forward and mark records to keep
select c.event, c.created_at, c.n,
case
when c.created_at >= p.current_base interval '14 days'
then c.created_at
else p.current_base
end as current_base,
(c.created_at >= p.current_base interval '14 days') as keep
from walk p
join ordered c
on (c.event, c.n) = (p.event, p.n 1)
)
select *
from walk
order by event, n;
Fiddle Here
CodePudding user response:
This lends itself to a procedural solution, since it has to walk the whole history of existing rows for each event. But SQL can do it, too.
The best solution heavily depends on cardinalities, data distribution, and other circumstances.
Assuming unfavorable conditions:
- Big table.
- Unknown number and identity of relevant events (event IDs).
- Many rows per event.
- Some overlap the 14-day time frame, some don't.
- Any number of duplicates possible.
You need an index like this one:
CREATE INDEX test_event_created_at_idx ON test (event, created_at);
Then the following query emulates an index-skip scan. If the table is vacuumed enough, it operates with index-only scans exclusively, in a single pass:
WITH RECURSIVE hit AS (
(
SELECT event, created_at
FROM test
ORDER BY event, created_at
LIMIT 1
)
UNION ALL
SELECT t.*
FROM hit h
CROSS JOIN LATERAL (
SELECT t.event, t.created_at
FROM test t
WHERE (t.event, t.created_at)
> (h.event, h.created_at interval '14 days')
ORDER BY t.event, t.created_at
LIMIT 1
) t
)
SELECT count(*) AS hits FROM hit;
I cannot stress enough how fast it's going to be. :)
It's a recursive CTE using a LATERAL
subquery, all based on the magic of ROW value comparison (which not all major RDBMS supported properly).
Effectively, we make Postgres skip over the above index once and only take qualifying rows.
For detailed explanation, see:
SELECT DISTINCT is slower than expected on my table in PostgreSQL
Optimize GROUP BY query to retrieve latest row per user (chapter 1a)
Different approach?
Like you mention yourself, the unfortunate task definition forces you to re-compute all newer rows for events where old data changes.
Consider working with a constant raster instead. Like a 14-day grid starting from with Jan 1 every year. Then the state of each event could be derived from the local frame. Much cheaper and more reliable.