Home > Enterprise >  TSQL - creating from-to date table while ignoring in-between steps with conditions
TSQL - creating from-to date table while ignoring in-between steps with conditions

Time:07-29

I have a following table structure in the MS SQL server

table 1

Now, I want to get [in] dates for each [id] according to these rules:

  1. If there is [out] at 23:59 and [in] at 00:00 on the next day ignore these
  2. If there is [out] and [in] on the same day ignore these

e.g. I should be getting following results:

id in out
12 2020-07-30 2020-09-06
13 2020-07-31 2021-08-23
14 2020-07-31 NULL

...

id in out
23 2020-08-10 2020-09-21
23 2021-03-31 2021-04-24
24 2020-08-10 NULL

...

I do not care about the [out]s, so they are not mandatory. Is there a simple way to achieve this?

I came with something like this:

select *from #temp1
EXCEPT
select t1.* from #temp1 t1
INNER join #temp1 t2
ON t1.id =t2.id
and (t1.[in] = DATEADD(minute,1,t2.[out]) or DAY(t1.[in]) = DAY(t2.[out]))

But I am losing some data, e.g. the second [in] for Id 23 from the example.

CodePudding user response:

I am confident there is a better solution than this. I wrestled with a enter image description here

Here is the dbfiddle.

Certainly my approach could be made more concise (i.e. less sub-queries), but I left them in there so they can be run from the inside out and you can see the progression of the logic. I think I could have avoided using the ROW_NUMBER() function if I could have gotten the LAST_VALUE() function to work. Either I do not know how to use it or it is broken, likely the former.

Finally, I would recommend to not use keywords such as "in" for column names. I just makes things simpler.

Noel

CodePudding user response:

Leveraging a recursive CTE (Common Table Expression) yields a cleaner approach imho. I'm not sure how it compares to Isaac's approach performance wise so you have to test that yourself.

Setup:

DROP TABLE IF EXISTS MyTable;
CREATE TABLE MyTable(
   id  INTEGER  NOT NULL
  ,[in]  VARCHAR(19) NOT NULL
  ,[out] VARCHAR(19)
);
INSERT INTO MyTable([id],[in],[out]) VALUES (12,'2020-07-30 00:00:00','2020-09-06 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (13,'2020-07-31 00:00:00','2020-11-02 13:00:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (13,'2020-11-02 13:30:00','2021-08-23 09:00:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (14,'2020-07-31 00:00:00','2021-11-04 11:00:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (14,'2021-11-04 11:05:00',NULL);
INSERT INTO MyTable([id],[in],[out]) VALUES (17,'2020-08-04 00:00:00','2021-11-03 11:00:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (17,'2021-11-03 11:05:00',NULL);
INSERT INTO MyTable([id],[in],[out]) VALUES (18,'2020-08-04 00:00:00','2021-10-22 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (19,'2020-08-05 00:00:00',NULL);
INSERT INTO MyTable([id],[in],[out]) VALUES (20,'2020-08-05 00:00:00','2021-04-10 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (21,'2020-08-05 14:00:00','2021-10-28 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (22,'2020-08-10 00:00:00',NULL);
INSERT INTO MyTable([id],[in],[out]) VALUES (23,'2020-08-10 00:00:00','2020-08-31 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (23,'2020-09-01 00:00:00','2020-09-14 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (23,'2020-09-15 00:00:00','2020-09-21 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (23,'2021-03-31 00:00:00','2021-04-01 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (23,'2021-04-02 00:00:00','2021-04-24 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (24,'2020-08-10 00:00:00','2020-08-16 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (24,'2020-08-17 00:00:00','2020-08-26 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (24,'2020-08-27 00:00:00','2020-09-12 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (24,'2020-09-13 00:00:00','2020-09-19 23:59:00');
INSERT INTO MyTable([id],[in],[out]) VALUES (24,'2020-09-20 00:00:00',NULL);

Solution:

WITH base AS 
(
    SELECT id, [in], out, ROW_NUMBER() OVER (ORDER BY id, [in]) AS seq
    FROM (SELECT *, 
                 LAG(out) OVER (PARTITION BY id ORDER BY [in]) AS prev_out 
          FROM MyTable) T
    WHERE prev_out IS NULL 
       OR (DATEADD(MINUTE, -1, [in]) != prev_out AND CAST([in] AS DATE) != CAST(prev_out AS DATE))
),

walker AS (
    SELECT id, [in], out, seq FROM base
    
    UNION ALL
    
    SELECT walker.id, MyTable.[in], MyTable.out, walker.seq
    FROM MyTable
    INNER JOIN walker 
        ON MyTable.id = walker.id
        AND (DATEADD(MINUTE, 1, MyTable.[in]) = walker.out
        OR CAST(MyTable.[in] AS DATE) = CAST(walker.out AS DATE))
)

SELECT id,
       MIN(CAST([in] AS DATE)) AS [in],
       MAX(CAST(out AS DATE))  AS out
FROM walker
GROUP BY id, seq
ORDER BY id, seq, [in];

Output: | id | in | out | | :--- | :--- | :--- | | 12 | 2020-07-30 | 2020-09-06 | | 13 | 2020-07-31 | 2021-08-23 | | 14 | 2020-07-31 | 2021-11-04 | | 17 | 2020-08-04 | 2021-11-03 | | 18 | 2020-08-04 | 2021-10-22 | | 19 | 2020-08-05 | null | | 20 | 2020-08-05 | 2021-04-10 | | 21 | 2020-08-05 | 2021-10-28 | | 22 | 2020-08-10 | null | | 23 | 2020-08-10 | 2020-08-31 | | 23 | 2021-03-31 | 2021-04-01 | | 24 | 2020-08-10 | 2020-08-16 |

See SQL Fiddle for sample data (thanks to Isaac)

Explanation:

  • In the first CTE (named 'base') we select the rows that form the start of a (potential) sequence. These rows don't have a proceeding row based on the rules specified in the OP. Each sequence gets a unique number assigned (ROW_NUMBER).
  • In the second CTE (named 'walker') we recursively walk the sequence from start to end, assigning the sequence number to each row and preserving the last 'out' value.
  • Finally, we select the MIN(in) and MAX(out) grouped by sequence number to get the desired results.
  • Related