Home > Net >  How to filter last rows based on some criteria from other columns in PostgreSQL
How to filter last rows based on some criteria from other columns in PostgreSQL

Time:01-05

Suppose I have a table (DeliveryEvent) like this grouped by delivery_id, where event_type 2 is an event of 'scheduled', 3 is an event of 'unscheduled', and 4 is an event of 'completed':

id created event_type delivery_id extra
1 2022-10-27 18:04 2 10005
2 2022-10-27 19:00 3 10005 {"couldn't deliver"}
3 2022-10-27 19:20 2 10005
4 2022-10-27 20:30 3 10005 {"timeout"}
5 2022-10-27 21:15 2 10005
6 2022-10-27 22:40 3 10005 {"timeout"}
7 2022-10-27 22:55 2 10005
8 2022-10-27 23:00 4 10005

...

I need a resulting table that for each unscheduled event due to 'timeout' I have information on the scheduled event that occurred just before this timeout, to get the duration between scheduled and unscheduled.

From the below SELECT I obtain different combinations of scheduled with the other unscheduled events by a timeout:

SELECT
    scheduled.id as scheduled_id, 
    scheduled.created as scheduled_time, 
    scheduled.event as scheduled_event, 
    scheduled.delivery_id as delivery_id,
    unscheduled.id as unscheduled_id, 
    unscheduled.created as unscheduled_time, 
    unscheduled.event as unscheduled_event, 
    unscheduled.extra as extra
FROM
    delivery_event scheduled_event
JOIN
    delivery_event unscheduled_event ON scheduled.delivery_id = 10005 
    AND unscheduled.delivery_id = 10005 
    AND unscheduled.event = 3 
    AND scheduled.event = 2 
    AND scheduled.created < unscheduled.created
    AND unscheduled.extra->>'timeout'
scheduled_id scheduled_time scheduled_event delivery_id unscheduled_id unscheduled_time scheduled_event extra
5 2022-10-27 21:15 2 10005 6 2022-10-27 22:40 3 {"timeout"}
3 2022-10-27 19:20 2 10005 6 2022-10-27 22:40 3 {"timeout"}
1 2022-10-27 18:04 2 10005 6 2022-10-27 22:40 3 {"timeout"}
3 2022-10-27 19:20 2 10005 4 2022-10-27 20:30 3 {"timeout"}
1 2022-10-27 18:04 2 10005 4 2022-10-27 20:30 3 {"timeout"}

but instead I'd like to only have the following result:

scheduled_id scheduled_time scheduled_event delivery_id unscheduled_id unscheduled_time scheduled_event extra
5 2022-10-27 21:15 2 10005 6 2022-10-27 22:40 3 {"timeout"}
3 2022-10-27 19:20 2 10005 4 2022-10-27 20:30 3 {"timeout"}

only the scheduled events that occurred right before the unscheduled ones, i.e., the last scheduled event before each unscheduled event due to timeout.

CodePudding user response:

Try this :

SELECT a.*
FROM (
SELECT delivery_id
     , lag(id, 1) OVER w AS scheduled_id
     , lag(created, 1) OVER w AS scheduled_time
     , lag(event , 1) OVER w AS scheduled_event
     , id AS unscheduled_id
     , created AS unscheduled_time
     , event AS unscheduled_event
     , extra
  FROM delivery_event
 WHERE delivery_id = 10005
WINDOW w AS (PARTITION BY delivery_id ORDER BY created ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
) AS a
WHERE a.scheduled_event = 2
  AND a.unscheduled_event = 3
  AND a.extra = '{"timeout"}' :: json

CodePudding user response:

Two solutions I can think of for this. The easiest would probably be to use a distinct on clause in order to drop all rows with duplicate unscheduled_id with an order by clause on scheduled.created in order to ensure you get back the correct row.

SELECT DISTINCT ON (unscheduled.id)
    scheduled.id as scheduled_id, 
    scheduled.created as scheduled_time, 
    scheduled.event as scheduled_event, 
    scheduled.delivery_id as delivery_id,
    unscheduled.id as unscheduled_id, 
    unscheduled.created as unscheduled_time, 
    unscheduled.event as unscheduled_event, 
    unscheduled.extra as extra
FROM
    delivery_event scheduled_event
JOIN
    delivery_event unscheduled_event ON scheduled.delivery_id = 10005 
    AND unscheduled.delivery_id = 10005 
    AND unscheduled.event = 3 
    AND scheduled.event = 2 
    AND scheduled.created < unscheduled.created
    AND unscheduled.extra->>'timeout'
ORDER BY scheduled.created DESC

An alternative solution would be to use a left lateral join such that only the latest scheduled event will be joined.

 SELECT
    scheduled.id as scheduled_id, 
    scheduled.created as scheduled_time, 
    scheduled.event as scheduled_event, 
    scheduled.delivery_id as delivery_id,
    unscheduled.id as unscheduled_id, 
    unscheduled.created as unscheduled_time, 
    unscheduled.event as unscheduled_event, 
    unscheduled.extra as extra
FROM
    delivery_event unscheduled_event
JOIN
    LEFT JOIN LATERAL (
        SELECT scheduled.id, 
        scheduled.created, 
        scheduled.event, 
        scheduled.delivery_id
        FROM delivery_event scheduled
        WHERE scheduled.delivery_id = 10005 
        AND scheduled.event = 2 
        AND scheduled.created < unscheduled.created
        ORDER BY scheduled.created DESC 
    LIMIT 1) scheduled ON TRUE 
WHERE unscheduled.delivery_id = 10005 
    AND unscheduled.event = 3 
    AND unscheduled.extra->>'timeout'

One note of warning on joining tables like this. Although joining with only one relevant row is very convenient it tends to come with significant performances drawbacks, so make sure to check if the performance of either of these methods are satisfactory for your particular circumstances.

  • Related