Home > Enterprise >  deduplication of slowly changing dimension data
deduplication of slowly changing dimension data

Time:12-01

Via the below code I try to describe my problem. Basically I want to go from slowly changing dimension data:

enter image description here

to some deduplicated data:

enter image description here

Obviously my actual data is bigger (more IDs and Values). In an ideal world I also want to avoid having to use the #Dates helper table if at all possible. My current attempt:

SELECT
    ID
    , Value1
    , Value2
    , MIN(#Dates.TheDate) AS StartDate
    , MAX(#Dates.TheDate) AS EndDate
FROM #Dates
INNER JOIN #Haves ON #Dates.TheDate BETWEEN #Haves.StartDate AND #Haves.EndDate
GROUP BY
    ID
    , Value1
    , Value2

does not produce the wants.

Code for replication:

IF OBJECT_ID(N'tempdb..#Dates') IS NOT NULL DROP TABLE #Dates;
IF OBJECT_ID(N'tempdb..#Haves') IS NOT NULL DROP TABLE #Haves;
IF OBJECT_ID(N'tempdb..#Wants') IS NOT NULL DROP TABLE #Wants;

DECLARE @FromDate DATETIME, @ToDate DATETIME;
SET @FromDate = '2020-01-01';
SET @ToDate = '2020-01-31';

-- all days in that period
SELECT TOP (DATEDIFF(DAY, @FromDate, @ToDate) 1)
  TheDate = DATEADD(DAY, number, @FromDate)
INTO #Dates
FROM [master].dbo.spt_values
WHERE [type] = N'P' ORDER BY number;

SELECT * FROM #Dates

SELECT
    *
INTO #Haves
FROM (SELECT 1 ID, '2020-01-01' AS StartDate, '2020-01-03' AS EndDate, 1 Value1, 1 Value2
      UNION
      SELECT 1 ID, '2020-01-03' AS StartDate, '2020-01-05' AS EndDate, 1 Value1, 1 Value2
      UNION
      SELECT 1 ID, '2020-01-05' AS StartDate, '2020-01-07' AS EndDate, 3 Value1, 1 Value2
      UNION
      SELECT 1 ID, '2020-01-07' AS StartDate, '2999-01-01' AS EndDate, 1 Value1, 1 Value2
) AS IQ1;

SELECT * from #Haves

SELECT
    ID
    , Value1
    , Value2
    , MIN(#Dates.TheDate) AS StartDate
    , MAX(#Dates.TheDate) AS EndDate
FROM #Dates
INNER JOIN #Haves ON #Dates.TheDate BETWEEN #Haves.StartDate AND #Haves.EndDate
GROUP BY
    ID
    , Value1
    , Value2

SELECT
    *
INTO #Wants
FROM (SELECT 1 ID, '2020-01-01' AS StartDate, '2020-01-05' AS EndDate, 1 Value1, 1 Value2
      UNION
      SELECT 1 ID, '2020-01-05' AS StartDate, '2020-01-07' AS EndDate, 3 Value1, 1 Value2
      UNION
      SELECT 1 ID, '2020-01-07' AS StartDate, '2999-01-01' AS EndDate, 1 Value1, 1 Value2
) AS IQ1;

SELECT * FROM #Wants

CodePudding user response:

This is called a gaps & islands problem. You want to detect groups of consecutive rows. Rows #1 and #2 in your screenshot are considered one group, because row #1's end date equals row #2's start date and id and value1 and value2 are equal.

So, let's detect all group changes first. Then just count the changes up to a row to get group numbers. Then aggregate to get the start and end date per group.

SELECT
  id,
  MIN(startdate) AS startdate,
  MAX(enddate) AS enddate,
  MIN(value1) AS value1,
  MIN(value2) AS value2
FROM
(
  SELECT
    id, startdate, enddate, value1, value2,
    SUM(chg) OVER (PARTITION BY id ORDER BY startdate) AS grp
  FROM
  (
    SELECT
      id, startdate, enddate, value1, value2,
      CASE WHEN startdate = LAG(enddate) OVER (PARTITION BY id ORDER BY startdate)
            AND value1    = LAG(value1)  OVER (PARTITION BY id ORDER BY startdate)
            AND value2    = LAG(value2)  OVER (PARTITION BY id ORDER BY startdate)
        THEN 0
        ELSE 1
      END AS chg
    FROM #Haves
  ) with_change_flags
) with_groups
GROUP BY id, grp
ORDER BY id, grp;
  • Related