Given two tables Event and Activity, I want to join all of the activities that happened at or after an Event started but before the next Event started. The difficult part is that Events do not have end datetimes only starting.
CREATE TABLE Event (
id INT PRIMARY KEY
,start_datetime DATETIME
);
CREATE TABLE Activity (
id INT PRIMARY KEY
,log_datetime DATETIME
);
INSERT INTO Event (id,start_datetime)
VALUES (1,'2020-01-01')
,(2,'2020-01-01 04:55:15')
,(3,'2021-05-01 16:23:45');
INSERT INTO Activity (id,log_datetime)
VALUES (1,'2020-01-01 00:00:00')
,(2,'2020-01-01 01:20:15')
,(3,'2020-01-01 16:23:45')
,(4,'2020-02-01 00:00:00')
,(5,'2021-05-10 13:00:00')
,(6,'2022-02-01 18:30:00');
SELECT
Event.id AS EventID
,Event.start_datetime
,Activity.id AS ActivityID
,Activity.log_datetime
FROM
Event
LEFT JOIN Activity
ON Activity.log_datetime >= Event.start_datetime
ORDER BY
Event.start_datetime
,Activity.log_datetime;
EventID | start_datetime | ActivityID | log_datetime |
---|---|---|---|
1 | 2020-01-01 00:00:00.000 | 1 | 2020-01-01 00:00:00.000 |
1 | 2020-01-01 00:00:00.000 | 2 | 2020-01-01 01:20:15.000 |
1 | 2020-01-01 00:00:00.000 | 3 | 2020-01-01 16:23:45.000 |
1 | 2020-01-01 00:00:00.000 | 4 | 2020-02-01 00:00:00.000 |
1 | 2020-01-01 00:00:00.000 | 5 | 2021-05-10 13:00:00.000 |
1 | 2020-01-01 00:00:00.000 | 6 | 2022-02-01 18:30:00.000 |
2 | 2020-01-01 04:55:15.000 | 3 | 2020-01-01 16:23:45.000 |
2 | 2020-01-01 04:55:15.000 | 4 | 2020-02-01 00:00:00.000 |
2 | 2020-01-01 04:55:15.000 | 5 | 2021-05-10 13:00:00.000 |
2 | 2020-01-01 04:55:15.000 | 6 | 2022-02-01 18:30:00.000 |
3 | 2021-05-01 16:23:45.000 | 5 | 2021-05-10 13:00:00.000 |
3 | 2021-05-01 16:23:45.000 | 6 | 2022-02-01 18:30:00.000 |
The results I am trying to get would be:
EventID | start_datetime | ActivityID | log_datetime |
---|---|---|---|
1 | 2020-01-01 00:00:00.000 | 1 | 2020-01-01 00:00:00.000 |
1 | 2020-01-01 00:00:00.000 | 2 | 2020-01-01 01:20:15.000 |
2 | 2020-01-01 04:55:15.000 | 3 | 2020-01-01 16:23:45.000 |
2 | 2020-01-01 04:55:15.000 | 4 | 2020-02-01 00:00:00.000 |
3 | 2021-05-01 16:23:45.000 | 5 | 2021-05-10 13:00:00.000 |
3 | 2021-05-01 16:23:45.000 | 6 | 2022-02-01 18:30:00.000 |
I know I could do this with an additional condition on my join, if Events had an end time.
What I am needing to do I think is first get a query to generate Events with the end_datetime
equal to the following Event start_datetime
:
EventID | start_datetime | end_datetime |
---|---|---|
1 | 2020-01-01 00:00:00.000 | 2020-01-01 04:55:15.000 |
2 | 2020-01-01 04:55:15.000 | 2021-05-01 16:23:45.000 |
3 | 2021-05-01 16:23:45.000 | NULL |
Which I am able to achieve with:
SELECT
e.id
,e.start_datetime
,MIN(e2.start_datetime) AS end_datetime
FROM
Event e
LEFT JOIN Event e2
ON e2.start_datetime>e.start_datetime
GROUP BY e.id, e.start_datetime
So the solution I have is to take the above and form it into a subquery and then join to it the Activity table using the start and now end datetimes:
SELECT
Events.id AS EventID
,Events.start_datetime
,Events.end_datetime
,Activity.id AS ActivityID
,Activity.log_datetime
FROM
(
SELECT
e.id
,e.start_datetime
,MIN(e2.start_datetime) AS end_datetime
FROM
Event e
LEFT JOIN Event e2
ON e2.start_datetime>e.start_datetime
GROUP BY e.id, e.start_datetime
) Events
LEFT JOIN Activity
ON Activity.log_datetime >= Events.start_datetime
AND (
Activity.log_datetime < Events.end_datetime
OR Events.end_datetime IS NULL
)
ORDER BY
Events.start_datetime
,Activity.log_datetime
Is there a better way to do this? I got the result wanted, but it doesn't feel very efficient with the subquery.
I know I can and should add indexes on the start_datetime
and log_datetime
columns to increase optimization. But is there any restructure of the query that would also help optimize?
Fiddle Link:
https://dbfiddle.uk/?rdbms=sqlserver_2016&fiddle=b5639752cc9cba32faee10bf3810d950
EDIT: The IDs of the Events are not always in the same order as the start_datetime
. An Event will always last until the next event is started. If there is no next event, then the current event is still in progress. The reason that the IDs are not consistent with the start_datetime
order is because this data is ingested and given auto incremented ids and sometimes the ingestion of Events may come out of sequential datetime order.
CodePudding user response:
You can simplify the query a bit. Start with this subquery to retrieve your Event table including an end_datetime. Using the LEAD windowing function.
SELECT id, start_datetime,
LEAD (start_datetime, 1) OVER (ORDER BY start_datetime ASC)
FROM Event
Will this make the query faster? Probably: LEAD()
is generally more efficient than MIN() ... LEFT JOIN ... GROUP BY
because it has to wrangle less data.
Then use it with your (correct) logic to join the Activity table. Using a Common Table Expression, the query now looks like this.
WITH StartEnd AS (
SELECT
id, start_datetime,
LEAD (start_datetime, 1) OVER (ORDER BY start_datetime ASC) end_datetime
FROM Event a
)
SELECT StartEnd.id EventId, StartEnd.start_datetime, StartEnd.end_datetime,
Activity.id ActivityId, Activity.log_datetime
FROM StartEnd
LEFT JOIN Activity
ON Activity.log_datetime >= StartEnd.start_datetime
AND ( Activity.log_datetime < StartEnd.end_datetime
OR StartEnd.end_datetime IS NULL)
ORDER BY StartEnd.start_datetime, Activity.log_datetime;
I suspect an index on Event (start_datetime)
will help. And so will an index on Activity(log_datetime)
. You can omit id
from those indexes because id
is the PK on both tables and is implicitly part of any other index.
But to be sure, when you run this query with your real data in SSMS, right-click in the query window, choose Show Actual Execution Plan, then run the query and examine the plan. Those Actual plans sometimes suggest appropriate indexes.
Common Table Expressions definitely make queries easier to read. And the way the query planner works, they don't harm performance compared to using subqueries.