No doubt a similar question has come up before, but I haven't been able to locate it by searching...
I have a raw dataset with time series data including 'from' and 'to' date fields.
The problem is, when data is loaded, new records have been created ('to' date added to old record, new record 'from' load date) even where no values have changed. I want to convert this to a table which just shows a row for each genuine change - and the from/ to dates reflecting this.
By way of example, the source data looks like this:
ID | Col1 | Col2 | Col3 | From | To |
---|---|---|---|---|---|
Test1 | 1 | 1 | 1 | 01/01/2020 | 31/12/9999 |
Test2 | 1 | 2 | 3 | 01/01/2020 | 30/06/2020 |
Test2 | 1 | 2 | 3 | 01/07/2020 | 30/09/2020 |
Test2 | 3 | 2 | 1 | 01/10/2020 | 31/12/9999 |
The first two records for Test2 (rows 2 and 3) are essentially the same - there was no change when the second row was loaded on 01/07/2020. I want a single row for the period 01/01/2020 - 30/09/2020 for which there was no change:
ID | Col1 | Col2 | Col3 | From | To |
---|---|---|---|---|---|
Test1 | 1 | 1 | 1 | 01/01/2020 | 31/12/9999 |
Test2 | 1 | 2 | 3 | 01/01/2020 | 30/09/2020 |
Test2 | 3 | 2 | 1 | 01/10/2020 | 31/12/9999 |
For this simplified example, I can achieve that by grouping by each column (apart from dates) and using the MIN from date/ MAX end date:
SELECT
ID, Col1, Col2, Col3, MIN(From) AS From, MAX(To) as TO
FROM TABLE
GROUP BY ID, Col1, Col2, Col3
However, this won't work if a value changes then subsequently changes back to what it was before eg
ID | Col1 | Col2 | Col3 | From | To |
---|---|---|---|---|---|
Test1 | 1 | 1 | 1 | 01/01/2020 | 31/12/9999 |
Test2 | 1 | 2 | 3 | 01/01/2020 | 30/04/2020 |
Test2 | 1 | 2 | 3 | 01/05/2020 | 30/06/2020 |
Test2 | 3 | 2 | 1 | 01/07/2020 | 30/10/2020 |
Test2 | 1 | 2 | 3 | 01/11/2020 | 31/12/9999 |
Simply using MIN/ MAX in the code above would return this - so it looks like both sets of values were valid for the period from 01/07/2020 - 30/10/2020:
ID | Col1 | Col2 | Col3 | From | To |
---|---|---|---|---|---|
Test1 | 1 | 1 | 1 | 01/01/2020 | 31/12/9999 |
Test2 | 1 | 2 | 3 | 01/01/2020 | 31/12/9999 |
Test2 | 3 | 2 | 1 | 01/07/2020 | 30/10/2020 |
Whereas actually the first set of values were valid before and after that period, but not during. It should return a single row for instead of two for the period from 01/01/2020 - 30/06/2020 when there were no changes for this ID, but then another row for the period when the values were different, and then another row where it reverted to the initial values, but with a new From date.
ID | Col1 | Col2 | Col3 | From | To |
---|---|---|---|---|---|
Test1 | 1 | 1 | 1 | 01/01/2020 | 31/12/9999 |
Test2 | 1 | 2 | 3 | 01/01/2020 | 30/06/2020 |
Test2 | 3 | 2 | 1 | 01/07/2020 | 30/10/2020 |
Test2 | 1 | 2 | 3 | 01/11/2020 | 31/12/9999 |
I'm struggling to conceptualise how to approach this. I'm guessing I need to use LAG somehow but not sure how to apply it - eg rank everything in a staging table first, then use LAG to compare a concatenation of the whole row?
I'm sure I could find a fudged way eventually, but I've no doubt this problem has been solved many times before so hoping somebody can point me to a simpler/ neater solution than I'd inevitably come up with...
CodePudding user response:
Advanced Gaps and Islands
I believe this is an advanced "gaps and islands" problem. Use that as a search term and you'll find plenty of literature on the subject. Only difference is normally only one column is being tracked, but you have 3.
No Gaps Assumption
One major assumption of this script is there is no gap in the overlapping dates, or in other words, it assumes the previous rows ToDate = current FromDate - 1 day. Not sure if you need to account for gaps, would be simple just add criteria to IsChanged to check for that
Multi-Column Gaps and Islands Solution
DROP TABLE IF EXISTS #Grouping
DROP TABLE IF EXISTS #Test
CREATE TABLE #Test (ID INT IDENTITY(1,1),TestName Varchar(10),Col1 INT,Col2 INT,Col3 INT,FromDate Date,ToDate DATE)
INSERT INTO #Test VALUES
('Test1',1,1,1,'2020-01-01','9999-12-31')
,('Test2',1,2,3,'2020-01-01','2020-04-30')
,('Test2',1,2,3,'2020-05-01','2020-06-30')
,('Test2',3,2,1,'2020-07-01','2020-10-30')
,('Test2',1,2,3,'2020-11-01','9999-12-31')
;WITH cte_Prev AS (
SELECT *
,PrevCol1 = LAG(Col1) OVER (PARTITION BY TestName ORDER BY FromDate)
,PrevCol2 = LAG(Col2) OVER (PARTITION BY TestName ORDER BY FromDate)
,PrevCol3 = LAG(Col3) OVER (PARTITION BY TestName ORDER BY FromDate)
FROM #Test
), cte_Compare AS (
SELECT *
,IsChanged = CASE
WHEN Col1 = PrevCol1
AND Col2 = PrevCol2
AND Col3 = PrevCol3
THEN 0 /*No change*/
ELSE 1 /*Iterate so new group created */
END
FROM cte_Prev
)
SELECT *,GroupID = SUM(IsChanged) OVER (PARTITION BY TestName ORDER BY ID)
INTO #Grouping
FROM cte_Compare
/*Raw unformatted data so you can see how it works*/
SELECT *
FROM #Grouping
/*Aggregated results*/
SELECT GroupID,TestName,Col1,Col2,Col3
,FromDate = MIN(FromDate)
,ToDate = MAX(ToDate)
,NumberOfRowsCollapsedIntoOneRow = COUNT(*)
FROM #Grouping
GROUP BY GroupID,TestName,Col1,Col2,Col3