I need to pull rows of data based on the existence of certain values that exist in a specific sequence.
Here's an example of the data:
Header | EventId | EventDate |
---|---|---|
67891882 | 382 | 2022-01-21 09:29:50.000 |
67891882 | 81 | 2022-01-21 09:03:23.000 |
67891882 | 273 | 2022-01-21 09:03:51.000 |
67891882 | 77 | 2022-01-21 09:05:58.000 |
67891882 | 2 | 2022-01-21 09:29:48.000 |
The results I need are to capture the Header and the EventDate for EventId=81. Further criteria include:
- EventID 81 is the "start" and EventID 77 is the "end"
- Any number of other events can exist between these two with the exception of (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)
So in the example above, Eventid 81 with EventDate 2022-01-21 09:03:23.000
would qualify as a row I want to pull as 273 is not in the exception list.
ATTEMPT: I have tried the following query
SELECT *
FROM #Table
WHERE EventDate BETWEEN (SELECT EventDate
FROM #Table
WHERE EventId = 81)
AND (SELECT eventdate
FROM #Table
WHERE EventId = 77)
AND EventId NOT IN (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)
ORDER BY 3
But I was immediately confronted with the fact that my sub-queries return more than one result, so this won't work (I was using this to test a singular Header # example, which worked fine). So now I'm not quite sure how to proceed. I'd hate to think that I'd be forced to use a CURSOR
, mostly because my source data is comprised of 266 million rows.
I had also previously tried using the LAG()
function to find my "starting point", but that possibility seemed to dissipate once the request started becoming more and more complex (with the addition of the exclusion list as well as the fact that there could be 1 or 40 rows in between the 81 and 77).
How should I proceed with this? Here's some example data to play with. The Header can be thought of as a parent key, associated with any number of EventID (representing a specific action) and the EventDate with when this occurred:
create table #data (header int, eventid int, eventdate datetime)
insert into #data
values
('62252595', '22', '5/23/2021 12:34:02 PM'),
('62252595', '81', '5/23/2021 12:34:03 PM'),
('62252595', '29', '5/23/2021 12:34:12 PM'),
('62252595', '40', '5/23/2021 12:34:27 PM'),
('62252595', '22', '5/23/2021 12:35:02 PM'),
('62252595', '22', '5/23/2021 12:36:12 PM'),
('62252595', '37', '5/23/2021 12:36:36 PM'),
('62252595', '77', '5/23/2021 12:37:04 PM'),
('62252595', '6', '5/23/2021 12:37:52 PM'),
('63252595', '39', '5/23/2021 12:38:01 PM'),
('63252595', '81', '5/23/2021 12:38:04 PM'),
('63252595', '37', '5/23/2021 12:38:06 PM'),
('63252595', '21', '5/23/2021 12:38:09 PM'),
('63252595', '75', '5/23/2021 12:38:10 PM'),
('63252595', '77', '5/23/2021 12:38:12 PM'),
('64252595', '29', '5/23/2021 12:38:15 PM'),
('64252595', '26', '5/23/2021 12:38:18 PM'),
('64252595', '81', '5/23/2021 12:38:20 PM'),
('64252595', '40', '5/23/2021 12:38:21 PM'),
('64252595', '81', '5/23/2021 12:38:24 PM'),
('64252595', '83', '5/23/2021 12:39:06 PM'),
('64252595', '77', '5/23/2021 12:39:07 PM'),
('65252595', '41', '5/23/2021 12:39:12 PM'),
('65252595', '81', '5/23/2021 12:39:16 PM'),
('65252595', '37', '5/23/2021 12:39:20 PM'),
('65252595', '18', '5/23/2021 12:39:56 PM'),
('65252595', '18', '5/23/2021 12:40:03 PM'),
('65252595', '77', '5/23/2021 12:40:15 PM'),
('65252595', '36', '5/23/2021 12:40:46 PM'),
('65252595', '77', '5/23/2021 12:40:53 PM')
EXPECTED RESULTS:
From this #Data
table, the results I would expect to see would be:
Header | EventId | EventDate |
---|---|---|
62252595 | 81 | 5/23/2021 12:34:03 PM |
65252595 | 81 | 5/23/2021 12:39:16 PM |
Header #'s 63252595 and 64252595 would not qualify because between the first instance of 81 and the first instance of 77 (partition by Header order by EventDate), there exists a 75 at 5/23/2021 12:38:10 PM
and an 83 at 5/23/2021 12:39:06 PM
respectively (both of which in exclusion list). I hope this clears up some confusion.
CodePudding user response:
It would be handy to see the actual expected results for the sample data so I don't actually know if this is correct, it looks like you just need to calculate date ranges per each header:
with h as (
select *,
Min(case when eventid=81 then eventdate end) over(partition by header) Sdate,
Max(case when eventid=77 then eventdate end) over(partition by header) Edate
from #data
)
select header, eventId, EventDate
from h
where eventdate between SDate and EDate
and EventId not in(60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)
order by eventdate;
CodePudding user response:
I'm going to assume that a start event (81) always starts a new "frame" from that row onwards, and an end event (77) always starts a new frame from the following row onwards.
I'm also going to assume that you're only interested in frames where both a start and end event are present, and that the frame contains no excepted events (I'll just use 00 as random allowable events and 199 as the only excepted event).
For example...
[81,00,81,00,77,00,81,199,77]
=> frame 1 = [81,00]
=> frame 2 = [81,00,77]
=> frame 3 = [00]
=> frame 4 = [81,199,77]
In that example only the 2nd frame's start event would be returned (the others missing start and/or end events, or containing the excepted event).
WITH
frame_start AS
(
SELECT
*,
CASE
WHEN
81 = eventid
OR
77 = LAG(eventid) OVER (PARTITION BY header ORDER BY eventdate)
THEN
1
ELSE
0
END
AS new_frame
FROM
#data
),
framed AS
(
SELECT
*,
SUM(new_frame) OVER (PARTITION BY header ORDER BY eventdate) AS frame_id
FROM
frame_start
)
SELECT
header, MIN(eventdate)
FROM
framed
GROUP BY
header, frame_id
HAVING
SUM(CASE WHEN eventid IN (81,77) THEN 1 ELSE 0 END) = 2
AND
MAX(CASE WHEN eventid IN (199, etc) THEN 1 ELSE 0 END) = 0
Demo : https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=d54493c87629e3e59759ac9d119ec6ad