I have an issue with removing or tagging overlapping timestamps grouped by certain ID.
Times can overlap in nest and may have same start time or end time.
If second time starts before previous time has ended it will end before or at the same time as previous time. No time differences will go over 12 hours.
Using T-SQL.
Sample data:
ID task_id starttime endtime
11 1 2023-01-10 06:31:00.000 2023-01-10 08:53:00.000
11 1 2023-01-10 08:00:00.000 2023-01-10 08:53:00.000
11 2 2023-01-10 13:14:00.000 2023-01-10 15:15:00.000
11 2 2023-01-10 15:46:00.000 2023-01-10 17:59:00.000
11 2 2023-01-10 18:49:00.000 2023-01-10 18:50:00.000
12 3 2023-01-09 10:10:00.000 2023-01-09 11:10:00.000
12 3 2023-01-09 10:10:00.000 2023-01-09 10:50:00.000
13 4 2023-01-08 20:00:00.000 2023-01-09 03:44:00.000
13 4 2023-01-08 21:00:00.000 2023-01-09 02:00:00.000
14 5 2023-01-01 19:23:00.000 2023-01-01 20:47:00.000
14 5 2023-01-02 03:35:00.000 2023-01-02 06:57:00.000
Desired result:
ID task_id starttime endtime
11 1 2023-01-10 06:31:00.000 2023-01-10 08:53:00.000
11 2 2023-01-10 13:14:00.000 2023-01-10 15:15:00.000
11 2 2023-01-10 15:46:00.000 2023-01-10 17:59:00.000
11 2 2023-01-10 18:49:00.000 2023-01-10 18:50:00.000
12 3 2023-01-09 10:10:00.000 2023-01-09 11:10:00.000
13 4 2023-01-08 20:00:00.000 2023-01-09 03:44:00.000
14 5 2023-01-01 19:23:00.000 2023-01-01 20:47:00.000
14 5 2023-01-02 03:35:00.000 2023-01-02 06:57:00.000
I've tried methods with lead or lag functions but it doesn't seem to play well with edge cases. For example:
case when lead(starttime) over (partition by task_id order by starttime) <> endtime then 1 else 0 end as overlap_tag
Doesn't count the time in ID 11 task_id 2 from 18:49-18:50 as not overlapping and doesn't seem to take into account the day changing.
CodePudding user response:
I only tested it on PostgreSQL, but it might help.
Conditions
- There is no previous task.
- Current task doesn't overlap with previous task.
Reference: Determine Whether Two Date Ranges Overlap
Preparation
CREATE TABLE task_duration (
id INTEGER,
task_id INTEGER,
start_time TIMESTAMP,
end_time TIMESTAMP
);
INSERT INTO task_duration VALUES (11, 1, '2023-01-10 06:31:00.000', '2023-01-10 08:53:00.000');
INSERT INTO task_duration VALUES (11, 1, '2023-01-10 08:00:00.000', '2023-01-10 08:53:00.000');
INSERT INTO task_duration VALUES (11, 2, '2023-01-10 13:14:00.000', '2023-01-10 15:15:00.000');
INSERT INTO task_duration VALUES (11, 2, '2023-01-10 15:46:00.000', '2023-01-10 17:59:00.000');
INSERT INTO task_duration VALUES (11, 2, '2023-01-10 18:49:00.000', '2023-01-10 18:50:00.000');
INSERT INTO task_duration VALUES (12, 3, '2023-01-09 10:10:00.000', '2023-01-09 11:10:00.000');
INSERT INTO task_duration VALUES (12, 3, '2023-01-09 10:10:00.000', '2023-01-09 10:50:00.000');
INSERT INTO task_duration VALUES (13, 4, '2023-01-08 20:00:00.000', '2023-01-09 03:44:00.000');
INSERT INTO task_duration VALUES (13, 4, '2023-01-08 21:00:00.000', '2023-01-09 02:00:00.000');
INSERT INTO task_duration VALUES (14, 5, '2023-01-01 19:23:00.000', '2023-01-01 20:47:00.000');
INSERT INTO task_duration VALUES (14, 5, '2023-01-02 03:35:00.000', '2023-01-02 06:57:00.000');
Query
SELECT id,
task_id,
start_time,
end_time
FROM (
SELECT id,
task_id,
start_time,
end_time,
LAG(start_time) OVER (PARTITION BY task_id ORDER BY task_id, start_time, end_time DESC) AS prev_start_time,
LAG(end_time) OVER (PARTITION BY task_id ORDER BY task_id, start_time, end_time DESC) AS prev_end_time
FROM task_duration
) v
WHERE prev_start_time IS NULL -- 1st condition
OR NOT (v.end_time >= v.prev_start_time AND v.start_time <= v.prev_end_time); -- 2nd condition
Result
id|task_id|start_time |end_time |
-- ------- ----------------------- -----------------------
11| 1|2023-01-10 06:31:00.000|2023-01-10 08:53:00.000|
11| 2|2023-01-10 13:14:00.000|2023-01-10 15:15:00.000|
11| 2|2023-01-10 15:46:00.000|2023-01-10 17:59:00.000|
11| 2|2023-01-10 18:49:00.000|2023-01-10 18:50:00.000|
12| 3|2023-01-09 10:10:00.000|2023-01-09 11:10:00.000|
13| 4|2023-01-08 20:00:00.000|2023-01-09 03:44:00.000|
14| 5|2023-01-01 19:23:00.000|2023-01-01 20:47:00.000|
14| 5|2023-01-02 03:35:00.000|2023-01-02 06:57:00.000|
CodePudding user response:
Try this https://dbfiddle.uk/id_waBN_
with task_duration_wrn(id, task_id, starttime, endtime, rn, act) as (
select id, task_id, starttime, endtime,
rank() over(partition by id, task_id order by starttime, endtime) as rn,
cast(
case when
starttime <= lag(endtime) over(partition by id, task_id order by starttime, endtime)
then 'PACK' end as VARCHAR(4))
from task_duration
),
cte(id, task_id, starttime, endtime, rn, lvl, act) as (
select d.id, d.task_id, d.starttime, d.endtime, d.rn, 1,
CAST(NULL AS VARCHAR(4))
from task_duration_wrn d
where act is NULL
union all
select d.id, d.task_id, c.starttime, d.endtime, d.rn, c.lvl 1, d.act
from cte c
join task_duration_wrn d on c.id = d.id and c.task_id = d.task_id and
c.lvl 1 = d.rn
where d.act = 'PACK'
)
select id, task_id, starttime, max(endtime) as endtime
from cte c
group by id, task_id, starttime
order by id, task_id, starttime