How to merge intervals based on a minimum gap of dates?-CodePudding

My current table is like the below, each patient has their visit start date and end date to a hospital, and they are administered a drug between admin_startdate and admin_enddate. For example, the first two rows mean, patient PT1 has two drug administrations, one between 01/08 and 01/10 & the other between 01/12 and 01/23, during her visit from 01/01 to 01/31.

ptid    visit_start_date    visit_end_date  admin_startdate    admin_enddate
PT1       2018-01-01          2018-01-31        2018-01-08        2018-01-10
PT1       2018-01-01          2018-01-31        2018-01-12        2018-01-23
PT2       2018-01-02          2018-01-18        2018-01-06        2018-01-11
PT2       2018-01-02          2018-01-18        2018-01-14        2018-01-17

What I would like to achieve is to lump together the drug administration that are too close together, say, the end date of the previous one is <= 2 days of the start date of new one, and call that a whole episode, like below:

ptid    visit_start_date    visit_end_date  admin_startdate    admin_enddate  episode_startdate episode_enddate
PT1       2018-01-01          2018-01-31    2018-01-08        2018-01-10          2018-01-08         2018-01-23
PT1       2018-01-01          2018-01-31    2018-01-12        2018-01-23          2018-01-08         2018-01-23
PT2       2018-01-02          2018-01-18    2018-01-06        2018-01-11          2018-01-06         2018-01-11
PT2       2018-01-02          2018-01-18    2018-01-14        2018-01-17          2018-01-14         2018-01-17

You can see that PT1's two administrations are lumped together with the same episode_startdate and episode_enddate, whereas PT2's two administrations are considered two separate episode.

I have a hard time to figure out how to do it in PostgreSQL (Redshift).

CodePudding user response：

This work in Postgres 14. Not tested for Redshift.

SELECT ptid, visit_start_date, visit_end_date, admin_startdate, admin_enddate
     , min(admin_startdate) OVER (PARTITION BY visit_id, admin) AS episode_startdate
     , max(admin_enddate)   OVER (PARTITION BY visit_id, admin) AS episode_enddate
FROM (
   SELECT *, count(*) FILTER (WHERE gap) OVER (PARTITION BY visit_id ORDER BY admin_startdate) AS admin
   FROM (
      SELECT *, admin_startdate - lag(admin_enddate) OVER (PARTITION BY visit_id ORDER BY admin_startdate) > 2 AS gap
      FROM  (
         SELECT *, dense_rank() OVER (ORDER BY ptid, visit_start_date, visit_end_date) AS visit_id  -- optional, to simplify
         FROM   tbl
         ) sub1
      ) sub2
   ) sub3

db<>fiddle here

The innermost subquery sub1 is only to compute a unique visit_id - which should really be in your table instead of repeating (ptid, visit_start_date, visit_end_date ) over and over. Consider normalizing your design at least that much.

The next subquery sub2 checks for a gap that's greater than two days to the previous row in the same partition.

Subquery sub3 then counts those gaps to identify distinct administration periods (admin)

In the outer SELECT, min(admin_startdate) and max(admin_enddate) per administration period produce the desired episode dates.

See (with assorted links to more):

How to group timestamps into islands (based on arbitrary gap)?

CodePudding user response：

    CREATE TABLE tb1 AS (
SELECT *, admin_startdate - lag(admin_enddate) OVER (PARTITION BY visit_id ORDER BY admin_startdate) > 2 AS gap
      FROM  (
         SELECT *, dense_rank() OVER (ORDER BY ptid, visit_start_date, visit_end_date) AS visit_id  -- optional, to simplify
         FROM  tbl
) sub1
) ;

CREATE TABLE tb2 AS (
   SELECT *, count(*)  OVER (PARTITION BY visit_id ORDER BY admin_startdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS admin
   FROM tb1
   WHERE gap is True
)
;

CREATE TABLE tb3 AS (
   SELECT tb1.ptid, tb1.visit_start_date, tb1.visit_end_date, tb1.admin_startdate, tb1.admin_enddate, tb1.visit_id, tb1.gap, 
   CASE WHEN tb2.admin is NULL THEN 0 else tb2.admin END AS admin
   FROM tb1 
   LEFT JOIN tb2
   ON tb1.ptid = tb2.ptid AND tb1.visit_start_date = tb2.visit_start_date AND tb1.visit_end_date = tb2.visit_end_date AND tb1.admin_startdate = tb2.admin_startdate AND tb1.admin_enddate = tb2.admin_enddate AND  tb1.visit_id = tb2.visit_id
)
;

CREATE TABLE tb4 AS (
   SELECT ptid, visit_start_date, visit_end_date, admin_startdate, admin_enddate
     , min(admin_startdate) OVER (PARTITION BY visit_id, admin) AS episode_startdate
     , max(admin_enddate)   OVER (PARTITION BY visit_id, admin) AS episode_enddate
     FROM tb3
)

This is an uglier version adapted from Erwin's answer for Redshift, which does not support FILTER operation. Tested correctly at least on db fiddle