I am trying to group some data by dates that do not have gaps
I am also actually trying to build this as a view.
identifier start_date end_date
4EF 2021-07-29 2021-08-05
4EF 2021-08-06 2021-08-09
4EF 2021-10-07 2021-10-12
4EF 2021-10-13 2021-11-12
2AB 2021-01-01 2021-06-20
2AB 2021-06-22 2021-12-01
The desired result of the above would be:
identifier start_date end_date
4EF 2021-07-29 2021-08-09
4EF 2021-10-07 2021-11-12
2AB 2021-01-01 2021-06-20
2AB 2021-06-22 2021-12-01
What would be the most efficient way to do this?
CodePudding user response:
This is fairly tricky problem, because it is a gaps and islands problem, but your initial data set is far removed from having clear values for the gaps or islands. Here is one approach:
WITH cte AS (
SELECT *, CASE WHEN start_date - LAG(end_date) OVER (PARTITION BY identifier
ORDER BY start_date) = 1
THEN 1 ELSE 0 END AS label
FROM yourTable
),
cte2 AS (
SELECT *, SUM(CASE WHEN label = 0 THEN 1 ELSE 0 END)
OVER (PARTITION BY identifier ORDER BY start_date) AS grp
FROM cte
)
SELECT
identifier,
MIN(start_date) AS start_date,
MAX(end_date) AS end_date
FROM cte2
GROUP BY
identifier,
grp
ORDER BY
identifier,
MIN(start_date);
Demo
The logic of the first CTE is to assign a 0 or 1 depending on whether the previous record for a given identifier were part of the same continuous date range or not. Once we have these 0s and 1s assigned, then we can take a rolling sum within each identifier to come up with a group number. The point here is that we generate a pseudo group for each identifier and continuous dates.