Get data from first and last row of groups based on sequence of rows-CodePudding

Taking it a step further than in my previous question: I have a table below with sample data.

datetimestamp	message	server	system
2022-07-13 08:59:09	Normal	Server 1	System 1
2022-07-13 08:59:10	Normal	Server 4	System 2
2022-07-13 08:59:11	Normal	Server 3	System 3
2022-07-13 08:59:12	Warning	Server 35	System 27
2022-07-13 08:59:13	Warning	Server 5	System 5
2022-07-13 08:59:14	Warning	Server 9	System 6
2022-07-13 08:59:15	Warning	Server 8	System 7
2022-07-13 08:59:16	Error	Server 12	System 8
2022-07-13 08:59:17	Error	Server 15	System 9
2022-07-13 08:59:18	Warning	Server 29	System 10
2022-07-13 08:59:19	Warning	Server 22	System 11
2022-07-13 08:59:20	Warning	Server 13	System 12
2022-07-13 08:59:21	Normal	Server 16	System 13
2022-07-13 08:59:22	Normal	Server 19	System 14
2022-07-13 08:59:23	Normal	Server 21	System 15
2022-07-13 08:59:24	Warning	Server 24	System 16
2022-07-13 08:59:25	Warning	Server 27	System 17
2022-07-13 08:59:26	Warning	Server 25	System 18
2022-07-13 08:59:27	Error	Server 30	System 23
2022-07-13 08:59:28	Error	Server 31	System 20

I need to write a Postgres query which will give me output like:

Start_Datetime	End_Datetime	Server_Start	Server_End	System_Start	System_End
2022-07-13 08:59:12	2022-07-13 08:59:15	Server 35	Server 8	System 27	System 7
2022-07-13 08:59:24	2022-07-13 08:59:26	Server 24	Server 25	System 16	System 18

If there is any 'Error' message, I need to take the start time of 'Warning' and end time of 'Warning' - and corresponding server and system! If there is no 'Error' message after 'Warning' ignore it e.g. after the warning at 2022-07-13 08:59:20 there is no error, so the query should ignore that range.

Setup table queries:

CREATE TABLE test_data (
  id integer PRIMARY KEY
, message varchar(10)
, datetimestamp timestamp NOT NULL
, server varchar(10)
, system varchar(10)
);

INSERT INTO test_data VALUES
  (09, 'Normal' , '2022-07-13 08:59:09', 'Server 1' , 'System 1')
, (10, 'Normal' , '2022-07-13 08:59:10', 'Server 4' , 'System 2')
, (11, 'Normal' , '2022-07-13 08:59:11', 'Server 3' , 'System 3')
, (12, 'Warning', '2022-07-13 08:59:12', 'Server 35', 'System 27')
, (13, 'Warning', '2022-07-13 08:59:13', 'Server 5' , 'System 5')
, (14, 'Warning', '2022-07-13 08:59:14', 'Server 9' , 'System 6')
, (15, 'Warning', '2022-07-13 08:59:15', 'Server 8' , 'System 7')
, (16, 'Error'  , '2022-07-13 08:59:16', 'Server 12', 'System 8')
, (17, 'Error'  , '2022-07-13 08:59:17', 'Server 15', 'System 9')
, (18, 'Warning', '2022-07-13 08:59:18', 'Server 29', 'System 10')
, (19, 'Warning', '2022-07-13 08:59:19', 'Server 22', 'System 11')
, (20, 'Warning', '2022-07-13 08:59:20', 'Server 13', 'System 12')
, (21, 'Normal' , '2022-07-13 08:59:21', 'Server 16', 'System 13')
, (22, 'Normal' , '2022-07-13 08:59:22', 'Server 19', 'System 14')
, (23, 'Normal' , '2022-07-13 08:59:23', 'Server 21', 'System 15')
, (24, 'Warning', '2022-07-13 08:59:24', 'Server 24', 'System 16')
, (25, 'Warning', '2022-07-13 08:59:25', 'Server 27', 'System 17')
, (26, 'Warning', '2022-07-13 08:59:26', 'Server 25', 'System 18')
, (27, 'Error'  , '2022-07-13 08:59:27', 'Server 30', 'System 23')
, (28, 'Error'  , '2022-07-13 08:59:28', 'Server 31', 'System 20')
;

CodePudding user response：

This is a kind of gaps and islands problem. First detect groups of messages and then find lead / lag group kind. Take the combinations needed.

with mg as(
  select message, min(id) id1, max(id) id2
    , lag(message, 1, 'Normal') over(order by min(id)) prevmsg
    , lead(message,1, 'Error') over(order by min(id)) nxtmsg
  from (
    select *, row_number() over(order by datetimestamp) - row_number() over(partition by message order by datetimestamp) grp
    from test_data
  ) t
  group by message, grp
)
select mg.message, mg.id1, mg.id2
from mg
where mg.message ='Warning' and mg.prevmsg='Normal' and mg.nxtmsg='Error'

With your test data it will return

message id1 id2
Warning 12  15
Warning 24  26

Join it to the original data twice to select columns you want for id1 and id2.

P.S. Correct you sample data INSERT, single quotes are missing.

CodePudding user response：

You may try the following:

With cte As 
(
  Select id, message,datetimestamp, 
  Coalesce(Lead(message) Over (Order By datetimestamp,id) , 'last') as ld,
  Coalesce(lag(message) Over (Order By datetimestamp,id) , 'first') as lg
  From test_data
),
cte2 As
(
  Select id,message,ld,lg,
  Coalesce(Lag(id) over (order by datetimestamp,id),id) lastWarningId,
  Coalesce(Lag(id, 2) over (order by datetimestamp,id),
           Lag(id) over (order by datetimestamp,id)) firstWarningId
  From cte where message<> ld or message <> lg
)
Select T.datetimestamp as Start_DateTime, T2.datetimestamp as End_DateTime, 
       T.server as Server_Start, T2.server as Server_End,
       T.system as System_Start, T2.system as System_End
  From cte2 C
  Join test_data T On T.id = C.firstwarningid
  Join test_data T2 On T2.id = C.lastwarningid
  Where C.message='Error' And C.lg='Warning'

See a demo from uk<>fiddle.

The idea of this query is to find the first and last id for each group of messages, which is done in cet2. Where:

Lag(id) over (order by datetimestamp,id) is the lastid of each group.

Lag(id, 2) over (order by datetimestamp,id) is the first id of each group.

Then for each group of message='Error' find the first and last id of the nearest group of message='Warning', which done by Where C.message='Error' And C.lg='Warning'.

To get more details about how it works, try to select * from cet2 Where message='Error' And lg='Warning'.

CodePudding user response：

Elaborating on my answer to your previous question, this does the job:

WITH cte AS (
   SELECT grp, datetimestamp, message, server, system
        , bool_or(message = 'Error' AND last_msg = 'Warning') OVER (PARTITION BY grp) AS report  -- qualifies for report
   FROM  (
      SELECT *
           , count(*) FILTER (WHERE message = 'Warning' AND last_msg <> 'Warning') OVER (ORDER BY datetimestamp) AS grp
      FROM  (
         SELECT datetimestamp, message, server, system
              , lag(message) OVER (ORDER BY datetimestamp) AS last_msg
         FROM   test_data
         ) sub1
      WHERE  message IN ('Warning', 'Error')  -- trim noise early
      ) sub2
   )
SELECT grp
     , a.datetimestamp AS start_time, z.datetimestamp AS end_time
     , a.server AS start_server, z.server AS end_server
     , a.system AS start_system, z.system AS end_system
FROM  (
   SELECT DISTINCT ON (grp) *
   FROM   cte
   WHERE  report
   ORDER  BY grp, datetimestamp
   ) a
JOIN  (
   SELECT DISTINCT ON (grp) *
   FROM   cte
   WHERE  report
   AND    message = 'Warning'  -- exclude trailing errors
   ORDER  BY grp, datetimestamp DESC
   ) z USING (grp);

db<>fiddle here

Produces your desired result exactly.

This relies on unique timestamps. If there can be duplicates, work with id additionally (or break ties some other way).

Move the base query from the previous answer to a CTE, and add the tag report to mark qualifying rows - we are going to reference it multiple times.

In the outer query, use two subqueries with DISTINCT ON to retrieve first and last row per group, then join ...

See: