SQL - gap and island issue for more attributes-CodePudding

I have following table, which apart from other attributes contains:

ID - unique identifier
Column1
Column2
Column3
CreatedDate - when the record has been created (based on ETL)
UpdatedDate - until when the record has been valid

Since there are other attributes apart from the 3 columns, which are being tracked for historical values, there might be cases, where there are multiple rows with the same values for all three columns for the same ID, but different timestamps in [CreatedDate] / [UpdatedDate]. Thus, the data may look like:

ID	Column1	Column2	Column3	CreatedDate	UpdatedDate
1122	T1	In Progress	NULL	02/02/2022 18:39:38	29/03/2022 14:25:24
1122	T1	In Progress	NULL	05/01/2022 10:45:50	02/02/2022 18:39:38
1122	T1	In Progress	NULL	03/01/2022 12:11:47	05/01/2022 10:45:50
1122	T1	In Progress	Yes	13/12/2021 21:43:44	03/01/2022 12:11:47
1122	T1	In Progress	NULL	17/02/2021 14:12:15	13/12/2021 21:43:44
1122	T1	In Progress	NULL	22/12/2020 14:38:32	17/02/2021 14:12:15
1122	T1	In Progress	NULL	17/12/2020 18:38:38	22/12/2020 14:38:32
1122	T3	Ready	NULL	30/03/2020 14:35:18	17/12/2020 18:38:38
1122	NULL	Ready	NULL	04/09/2019 18:33:24	30/03/2020 14:35:18
1122	T2	Ready	NULL	07/01/2019 11:07:39	04/09/2019 18:33:24
1122	T2	Ready	NULL	17/09/2018 14:31:17	07/01/2019 11:07:39
1122	T0	Ready	NULL	28/08/2018 14:31:39	17/09/2018 14:31:17
1122	T0	Ready	NULL	13/02/2018 14:48:44	28/08/2018 14:31:39

I would like to keep the unique values for all 3 columns in correct sequence, hence the ideal output should look like:

ID	Column1	Column2	Column3	CreatedDate	UpdatedDate
1122	T1	In Progress	NULL	03/01/2022 12:11:47	29/03/2022 14:25:24
1122	T1	In Progress	Yes	13/12/2021 21:43:44	03/01/2022 12:11:47
1122	T1	In Progress	NULL	17/12/2020 18:38:38	13/12/2021 21:43:44
1122	T3	Ready	NULL	30/03/2020 14:35:18	17/12/2020 18:38:38
1122	NULL	Ready	NULL	04/09/2019 18:33:24	30/03/2020 14:35:18
1122	T2	Ready	NULL	17/09/2018 14:31:17	04/09/2019 18:33:24
1122	T0	Ready	NULL	13/02/2018 14:48:44	17/09/2018 14:31:17

The below code works fine, if there is just one column, but it does not work for multiple columns as it returns all unique rows.

select ID, Column1, Column2, Column3,  min(createddate), max(updateddate)
from (select t.*,
             sum(case when prev_updatedate >= createddate then 0 else 1 end) over (partition by ID order by createddate) as grp
      from (select h.*,
                   max(updateddate) over (partition by ID order by createddate rows between unbounded preceding and 1 preceding) as prev_updatedate
            from #history h
           ) h
     ) h
group by ID, Column1, Column2, Column3, grp;

Any ideas how to fix it, please?

CodePudding user response：

You can try to use ROW_NUMBER window function to make the gap with your logic then you might get a grouping for gaps-and-islands

SELECT ID,Column1,Column2,Column3, min(createddate) CreatedDate, max(updateddate) UpdatedDate
FROM ( 
  select *,
      ROW_NUMBER() over (partition by ID order by createddate) - 
      ROW_NUMBER() over (partition by ID,Column1,Column2,Column3 order by createddate) grp
  from history
) t1
GROUP BY grp,ID,Column1,Column2,Column3
ORDER BY CreatedDate DESC

sqlfiddle