Stack several rows into one with date condition-CodePudding

I've got raw data from table with information about clients. Information comes from different sources, so it causes duplicates but with different dates:

id  pp  type  start_dt   end_dt
100| 1 | Y | 01.05.19 | 01.10.20
100| 1 | Y | 10.08.20 | 01.10.20
100| 1 | N | 01.10.20 | 02.12.21
100| 1 | N | 13.12.20 | 02.12.21
100| 1 | Y | 02.12.21 | 02.12.26
100| 1 | Y | 20.12.21 | 20.12.26

For example, in this table row 2, 4 and 6 have start date within "start_dt" and "end_dt" of previous row. It's a duplicate, but I need to combine min start date and max end date from both rows for type.

FYI. First two rows and last two rows have same id, pp and type, but I need to stack them separately because of the timeline.

What I want to get (continuous timeline for a client is a key):

id  pp  type  start_dt   end_dt  | cnt
100| 1 | Y | 01.05.19 | 01.10.20 | 2
100| 1 | N | 01.10.20 | 02.12.21 | 2
100| 1 | Y | 02.12.21 | 20.12.26 | 2

I'm using PL/SQL. I think it could be solved by window functions, but I can't figure out which functions to use.

Tried to solve it by group by while having > 1, but in this case it stacks four rows with same type (rows 1,2 and 5,6) into one. I need two separate rows for each type while saving continuous timeline of dates for one client.

CodePudding user response：

From Oracle 12, you can use MATCH_RECOGNIZE for row-by-row pattern matching:

SELECT *
FROM   table_name
MATCH_RECOGNIZE(
  PARTITION BY id, pp
  ORDER     BY start_dt
  MEASURES
    FIRST(type)     AS type,
    FIRST(start_dt) AS start_dt,
    MAX(end_dt)     AS end_dt,
    COUNT(*)        AS cnt
  PATTERN (overlapping* last_row)
  DEFINE
    overlapping AS type        = NEXT(type)
               AND MAX(end_dt) >= NEXT(start_dt)
)

Which, for the sample data:

CREATE TABLE table_name (id, pp, type, start_dt, end_dt) AS
SELECT 100, 1, 'Y', DATE '2019-05-01', DATE '2020-10-01' FROM DUAL UNION ALL
SELECT 100, 1, 'Y', DATE '2020-08-10', DATE '2020-10-01' FROM DUAL UNION ALL
SELECT 100, 1, 'N', DATE '2020-10-01', DATE '2021-12-02' FROM DUAL UNION ALL
SELECT 100, 1, 'N', DATE '2020-12-13', DATE '2021-12-02' FROM DUAL UNION ALL
SELECT 100, 1, 'Y', DATE '2021-12-02', DATE '2026-12-02' FROM DUAL UNION ALL
SELECT 100, 1, 'Y', DATE '2021-12-20', DATE '2026-12-20' FROM DUAL;

Outputs:

ID	PP	TYPE	START_DT	END_DT	CNT
100	1	Y	2019-05-01 00:00:00	2020-10-01 00:00:00	2
100	1	N	2020-10-01 00:00:00	2021-12-02 00:00:00	2
100	1	Y	2021-12-02 00:00:00	2026-12-20 00:00:00	2

fiddle

If you want to use analytic and aggregation functions then it is a bit more complicated:

SELECT id, pp, type,
       MIN(start_dt) AS start_dt,
       MAX(end_dt) AS end_dt,
       COUNT(*) AS cnt
FROM   (
  SELECT id, pp, type, start_dt, end_dt,
         SUM(grp_change) OVER (
                            PARTITION BY id, pp, type
                            ORDER BY start_dt
                          ) AS grp
  FROM   (
    SELECT t.*,
           CASE 
           WHEN start_dt <= MAX(end_dt) OVER (
                              PARTITION BY id, pp, type
                              ORDER BY start_dt
                              ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
                            )
           THEN 0
           ELSE 1
           END AS grp_change
    FROM   table_name t
  )
)
GROUP BY id, pp, type
ORDER BY id, pp, start_dt

(Untested as db<>fiddle seems to be having issues at the moment.)

CodePudding user response：

I prefer this version because comparing "type = next(type)" without "type" being in the "order by" may lead to errors.

match_recognize(
    partition by id, pp, type
    order by start_dt,end_dt
    measures first(start_dt) as start_dt, max(end_dt) as end_dt, count(*) as n
    pattern (merged* strt)
    define
        merged as  max(end_dt) >= next(start_dt)
)