Home > Blockchain >  SQL: Pull rows based on sequence of values
SQL: Pull rows based on sequence of values

Time:05-24

I need to pull rows of data based on the existence of certain values that exist in a specific sequence.

Here's an example of the data:

Header EventId EventDate
67891882 382 2022-01-21 09:29:50.000
67891882 81 2022-01-21 09:03:23.000
67891882 273 2022-01-21 09:03:51.000
67891882 77 2022-01-21 09:05:58.000
67891882 2 2022-01-21 09:29:48.000

The results I need are to capture the Header and the EventDate for EventId=81. Further criteria include:

  • EventID 81 is the "start" and EventID 77 is the "end"
  • Any number of other events can exist between these two with the exception of (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)

So in the example above, Eventid 81 with EventDate 2022-01-21 09:03:23.000 would qualify as a row I want to pull as 273 is not in the exception list.

ATTEMPT: I have tried the following query

SELECT *
FROM #Table
WHERE EventDate BETWEEN (SELECT EventDate
                         FROM #Table
                         WHERE EventId = 81)
                    AND (SELECT eventdate
                         FROM #Table
                         WHERE EventId = 77)
    AND EventId NOT IN (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)
ORDER BY 3

But I was immediately confronted with the fact that my sub-queries return more than one result, so this won't work (I was using this to test a singular Header # example, which worked fine). So now I'm not quite sure how to proceed. I'd hate to think that I'd be forced to use a CURSOR, mostly because my source data is comprised of 266 million rows.

I had also previously tried using the LAG() function to find my "starting point", but that possibility seemed to dissipate once the request started becoming more and more complex (with the addition of the exclusion list as well as the fact that there could be 1 or 40 rows in between the 81 and 77).


How should I proceed with this? Here's some example data to play with. The Header can be thought of as a parent key, associated with any number of EventID (representing a specific action) and the EventDate with when this occurred:

create table #data (header int, eventid int, eventdate datetime)

insert into #data 
values
('62252595',    '22',   '5/23/2021  12:34:02 PM'),
('62252595',    '81',   '5/23/2021  12:34:03 PM'),
('62252595',    '29',   '5/23/2021  12:34:12 PM'),
('62252595',    '40',   '5/23/2021  12:34:27 PM'),
('62252595',    '22',   '5/23/2021  12:35:02 PM'),
('62252595',    '22',   '5/23/2021  12:36:12 PM'),
('62252595',    '37',   '5/23/2021  12:36:36 PM'),
('62252595',    '77',   '5/23/2021  12:37:04 PM'),
('62252595',    '6',    '5/23/2021  12:37:52 PM'),
('63252595',    '39',   '5/23/2021  12:38:01 PM'),
('63252595',    '81',   '5/23/2021  12:38:04 PM'),
('63252595',    '37',   '5/23/2021  12:38:06 PM'),
('63252595',    '21',   '5/23/2021  12:38:09 PM'),
('63252595',    '75',   '5/23/2021  12:38:10 PM'),
('63252595',    '77',   '5/23/2021  12:38:12 PM'),
('64252595',    '29',   '5/23/2021  12:38:15 PM'),
('64252595',    '26',   '5/23/2021  12:38:18 PM'),
('64252595',    '81',   '5/23/2021  12:38:20 PM'),
('64252595',    '40',   '5/23/2021  12:38:21 PM'),
('64252595',    '81',   '5/23/2021  12:38:24 PM'),
('64252595',    '83',   '5/23/2021  12:39:06 PM'),
('64252595',    '77',   '5/23/2021  12:39:07 PM'),
('65252595',    '41',   '5/23/2021  12:39:12 PM'),
('65252595',    '81',   '5/23/2021  12:39:16 PM'),
('65252595',    '37',   '5/23/2021  12:39:20 PM'),
('65252595',    '18',   '5/23/2021  12:39:56 PM'),
('65252595',    '18',   '5/23/2021  12:40:03 PM'),
('65252595',    '77',   '5/23/2021  12:40:15 PM'),
('65252595',    '36',   '5/23/2021  12:40:46 PM'),
('65252595',    '77',   '5/23/2021  12:40:53 PM')

EXPECTED RESULTS: From this #Data table, the results I would expect to see would be:

Header EventId EventDate
62252595 81 5/23/2021 12:34:03 PM
65252595 81 5/23/2021 12:39:16 PM

Header #'s 63252595 and 64252595 would not qualify because between the first instance of 81 and the first instance of 77 (partition by Header order by EventDate), there exists a 75 at 5/23/2021 12:38:10 PM and an 83 at 5/23/2021 12:39:06 PM respectively (both of which in exclusion list). I hope this clears up some confusion.

CodePudding user response:

It would be handy to see the actual expected results for the sample data so I don't actually know if this is correct, it looks like you just need to calculate date ranges per each header:

with h as (
  select *, 
    Min(case when eventid=81 then eventdate end) over(partition by header) Sdate, 
    Max(case when eventid=77 then eventdate end) over(partition by header) Edate
  from #data
)
select header, eventId, EventDate
from h
where eventdate between SDate and EDate
and EventId not in(60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)
order by eventdate;

CodePudding user response:

I'm going to assume that a start event (81) always starts a new "frame" from that row onwards, and an end event (77) always starts a new frame from the following row onwards.

I'm also going to assume that you're only interested in frames where both a start and end event are present, and that the frame contains no excepted events (I'll just use 00 as random allowable events and 199 as the only excepted event).

For example...

[81,00,81,00,77,00,81,199,77]
=> frame 1 = [81,00]
=> frame 2 = [81,00,77]
=> frame 3 = [00]
=> frame 4 = [81,199,77]

In that example only the 2nd frame's start event would be returned (the others missing start and/or end events, or containing the excepted event).

WITH
  frame_start AS
(
  SELECT
    *,
    CASE
      WHEN
        81 = eventid
      OR
        77 = LAG(eventid) OVER (PARTITION BY header ORDER BY eventdate)
      THEN
        1
      ELSE
        0
    END
      AS new_frame
  FROM
    #data
),
  framed AS
(
  SELECT
    *,
    SUM(new_frame) OVER (PARTITION BY header ORDER BY eventdate) AS frame_id
  FROM
    frame_start
)
SELECT
  header, MIN(eventdate)
FROM
  framed
GROUP BY
  header, frame_id
HAVING
  SUM(CASE WHEN eventid IN (81,77) THEN 1 ELSE 0 END) = 2
  AND
  MAX(CASE WHEN eventid IN (199, etc) THEN 1 ELSE 0 END) = 0

Demo : https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=d54493c87629e3e59759ac9d119ec6ad

  • Related