SQL Join after datetime but before next rows start datetime-CodePudding

Given two tables Event and Activity, I want to join all of the activities that happened at or after an Event started but before the next Event started. The difficult part is that Events do not have end datetimes only starting.

CREATE TABLE Event (
  id INT PRIMARY KEY
  ,start_datetime DATETIME
);

CREATE TABLE Activity (
  id INT PRIMARY KEY
  ,log_datetime DATETIME
);

INSERT INTO Event (id,start_datetime)
VALUES (1,'2020-01-01')
      ,(2,'2020-01-01 04:55:15')
      ,(3,'2021-05-01 16:23:45');
      
INSERT INTO Activity (id,log_datetime)
VALUES (1,'2020-01-01 00:00:00')
      ,(2,'2020-01-01 01:20:15')
      ,(3,'2020-01-01 16:23:45')
      ,(4,'2020-02-01 00:00:00')
      ,(5,'2021-05-10 13:00:00')
      ,(6,'2022-02-01 18:30:00');

SELECT
  Event.id AS EventID
  ,Event.start_datetime
  ,Activity.id AS ActivityID
  ,Activity.log_datetime
FROM
  Event
  LEFT JOIN Activity
    ON Activity.log_datetime >= Event.start_datetime
ORDER BY
  Event.start_datetime
  ,Activity.log_datetime;

EventID	start_datetime	ActivityID	log_datetime
1	2020-01-01 00:00:00.000	1	2020-01-01 00:00:00.000
1	2020-01-01 00:00:00.000	2	2020-01-01 01:20:15.000
1	2020-01-01 00:00:00.000	3	2020-01-01 16:23:45.000
1	2020-01-01 00:00:00.000	4	2020-02-01 00:00:00.000
1	2020-01-01 00:00:00.000	5	2021-05-10 13:00:00.000
1	2020-01-01 00:00:00.000	6	2022-02-01 18:30:00.000
2	2020-01-01 04:55:15.000	3	2020-01-01 16:23:45.000
2	2020-01-01 04:55:15.000	4	2020-02-01 00:00:00.000
2	2020-01-01 04:55:15.000	5	2021-05-10 13:00:00.000
2	2020-01-01 04:55:15.000	6	2022-02-01 18:30:00.000
3	2021-05-01 16:23:45.000	5	2021-05-10 13:00:00.000
3	2021-05-01 16:23:45.000	6	2022-02-01 18:30:00.000

The results I am trying to get would be:

EventID	start_datetime	ActivityID	log_datetime
1	2020-01-01 00:00:00.000	1	2020-01-01 00:00:00.000
1	2020-01-01 00:00:00.000	2	2020-01-01 01:20:15.000
2	2020-01-01 04:55:15.000	3	2020-01-01 16:23:45.000
2	2020-01-01 04:55:15.000	4	2020-02-01 00:00:00.000
3	2021-05-01 16:23:45.000	5	2021-05-10 13:00:00.000
3	2021-05-01 16:23:45.000	6	2022-02-01 18:30:00.000

I know I could do this with an additional condition on my join, if Events had an end time.

What I am needing to do I think is first get a query to generate Events with the end_datetime equal to the following Event start_datetime:

EventID	start_datetime	end_datetime
1	2020-01-01 00:00:00.000	2020-01-01 04:55:15.000
2	2020-01-01 04:55:15.000	2021-05-01 16:23:45.000
3	2021-05-01 16:23:45.000	NULL

Which I am able to achieve with:

SELECT
  e.id
  ,e.start_datetime
  ,MIN(e2.start_datetime) AS end_datetime
FROM
  Event e
  LEFT JOIN Event e2
    ON e2.start_datetime>e.start_datetime
GROUP BY e.id, e.start_datetime

So the solution I have is to take the above and form it into a subquery and then join to it the Activity table using the start and now end datetimes:

SELECT
  Events.id AS EventID
  ,Events.start_datetime
  ,Events.end_datetime
  ,Activity.id AS ActivityID
  ,Activity.log_datetime
FROM
  (
    SELECT
      e.id
      ,e.start_datetime
      ,MIN(e2.start_datetime) AS end_datetime
    FROM
      Event e
      LEFT JOIN Event e2
        ON e2.start_datetime>e.start_datetime
    GROUP BY e.id, e.start_datetime
  ) Events
  LEFT JOIN Activity
    ON Activity.log_datetime >= Events.start_datetime
       AND (
         Activity.log_datetime < Events.end_datetime
         OR Events.end_datetime IS NULL
       )
ORDER BY
  Events.start_datetime
  ,Activity.log_datetime

Is there a better way to do this? I got the result wanted, but it doesn't feel very efficient with the subquery.

I know I can and should add indexes on the start_datetime and log_datetime columns to increase optimization. But is there any restructure of the query that would also help optimize?

Fiddle Link:

https://dbfiddle.uk/?rdbms=sqlserver_2016&fiddle=b5639752cc9cba32faee10bf3810d950

EDIT: The IDs of the Events are not always in the same order as the start_datetime. An Event will always last until the next event is started. If there is no next event, then the current event is still in progress. The reason that the IDs are not consistent with the start_datetime order is because this data is ingested and given auto incremented ids and sometimes the ingestion of Events may come out of sequential datetime order.

CodePudding user response：

You can simplify the query a bit. Start with this subquery to retrieve your Event table including an end_datetime. Using the LEAD windowing function.

      SELECT id, start_datetime, 
             LEAD (start_datetime, 1) OVER (ORDER BY start_datetime ASC) 
        FROM Event

Will this make the query faster? Probably: LEAD() is generally more efficient than MIN() ... LEFT JOIN ... GROUP BY because it has to wrangle less data.

Then use it with your (correct) logic to join the Activity table. Using a Common Table Expression, the query now looks like this.

WITH StartEnd AS (
  SELECT 
    id, start_datetime, 
    LEAD (start_datetime, 1) OVER (ORDER BY start_datetime ASC) end_datetime
  FROM Event a
)
SELECT StartEnd.id EventId, StartEnd.start_datetime, StartEnd.end_datetime,
       Activity.id ActivityId, Activity.log_datetime
  FROM StartEnd
  LEFT JOIN Activity 
           ON Activity.log_datetime >= StartEnd.start_datetime
          AND (    Activity.log_datetime < StartEnd.end_datetime
                OR StartEnd.end_datetime IS NULL) 
ORDER BY StartEnd.start_datetime, Activity.log_datetime;

Fiddle

I suspect an index on Event (start_datetime) will help. And so will an index on Activity(log_datetime). You can omit id from those indexes because id is the PK on both tables and is implicitly part of any other index.

But to be sure, when you run this query with your real data in SSMS, right-click in the query window, choose Show Actual Execution Plan, then run the query and examine the plan. Those Actual plans sometimes suggest appropriate indexes.

Common Table Expressions definitely make queries easier to read. And the way the query planner works, they don't harm performance compared to using subqueries.