Via the below code I try to describe my problem. Basically I want to go from slowly changing dimension data:
to some deduplicated data:
Obviously my actual data is bigger (more IDs and Values). In an ideal world I also want to avoid having to use the #Dates helper table if at all possible. My current attempt:
SELECT
ID
, Value1
, Value2
, MIN(#Dates.TheDate) AS StartDate
, MAX(#Dates.TheDate) AS EndDate
FROM #Dates
INNER JOIN #Haves ON #Dates.TheDate BETWEEN #Haves.StartDate AND #Haves.EndDate
GROUP BY
ID
, Value1
, Value2
does not produce the wants.
Code for replication:
IF OBJECT_ID(N'tempdb..#Dates') IS NOT NULL DROP TABLE #Dates;
IF OBJECT_ID(N'tempdb..#Haves') IS NOT NULL DROP TABLE #Haves;
IF OBJECT_ID(N'tempdb..#Wants') IS NOT NULL DROP TABLE #Wants;
DECLARE @FromDate DATETIME, @ToDate DATETIME;
SET @FromDate = '2020-01-01';
SET @ToDate = '2020-01-31';
-- all days in that period
SELECT TOP (DATEDIFF(DAY, @FromDate, @ToDate) 1)
TheDate = DATEADD(DAY, number, @FromDate)
INTO #Dates
FROM [master].dbo.spt_values
WHERE [type] = N'P' ORDER BY number;
SELECT * FROM #Dates
SELECT
*
INTO #Haves
FROM (SELECT 1 ID, '2020-01-01' AS StartDate, '2020-01-03' AS EndDate, 1 Value1, 1 Value2
UNION
SELECT 1 ID, '2020-01-03' AS StartDate, '2020-01-05' AS EndDate, 1 Value1, 1 Value2
UNION
SELECT 1 ID, '2020-01-05' AS StartDate, '2020-01-07' AS EndDate, 3 Value1, 1 Value2
UNION
SELECT 1 ID, '2020-01-07' AS StartDate, '2999-01-01' AS EndDate, 1 Value1, 1 Value2
) AS IQ1;
SELECT * from #Haves
SELECT
ID
, Value1
, Value2
, MIN(#Dates.TheDate) AS StartDate
, MAX(#Dates.TheDate) AS EndDate
FROM #Dates
INNER JOIN #Haves ON #Dates.TheDate BETWEEN #Haves.StartDate AND #Haves.EndDate
GROUP BY
ID
, Value1
, Value2
SELECT
*
INTO #Wants
FROM (SELECT 1 ID, '2020-01-01' AS StartDate, '2020-01-05' AS EndDate, 1 Value1, 1 Value2
UNION
SELECT 1 ID, '2020-01-05' AS StartDate, '2020-01-07' AS EndDate, 3 Value1, 1 Value2
UNION
SELECT 1 ID, '2020-01-07' AS StartDate, '2999-01-01' AS EndDate, 1 Value1, 1 Value2
) AS IQ1;
SELECT * FROM #Wants
CodePudding user response:
This is called a gaps & islands problem. You want to detect groups of consecutive rows. Rows #1 and #2 in your screenshot are considered one group, because row #1's end date equals row #2's start date and id and value1 and value2 are equal.
So, let's detect all group changes first. Then just count the changes up to a row to get group numbers. Then aggregate to get the start and end date per group.
SELECT
id,
MIN(startdate) AS startdate,
MAX(enddate) AS enddate,
MIN(value1) AS value1,
MIN(value2) AS value2
FROM
(
SELECT
id, startdate, enddate, value1, value2,
SUM(chg) OVER (PARTITION BY id ORDER BY startdate) AS grp
FROM
(
SELECT
id, startdate, enddate, value1, value2,
CASE WHEN startdate = LAG(enddate) OVER (PARTITION BY id ORDER BY startdate)
AND value1 = LAG(value1) OVER (PARTITION BY id ORDER BY startdate)
AND value2 = LAG(value2) OVER (PARTITION BY id ORDER BY startdate)
THEN 0
ELSE 1
END AS chg
FROM #Haves
) with_change_flags
) with_groups
GROUP BY id, grp
ORDER BY id, grp;