Query to calculate number of active jobs-CodePudding

There is a jobs table,

jobID	state	timestamp
1	inProgress	2023-11-02 10:39:37
1	some random log	2023-11-02 10:40:37
2	inProgress	2023-11-02 10:43:37
1	inProgress	2023-11-02 10:44:37
1	inProgress	2023-11-02 10:45:37
1	inProgress	2023-11-02 10:50:00
1	end	2023-11-02 11:40:37
1	inProgress	2023-11-02 11:43:37
1	inProgress	2023-11-02 11:44:37
1	end	2023-11-02 11:45:37
2	some random log	2023-11-02 11:50:37
2	end	2023-11-02 12:00:37

I would like to

calculate how many jobs are in progress (i.e. has already started but not end yet)

I first thought I could calculate count(state=inProgress)-count(state=end) to get the number of inProgress jobs but inProgress could show up not only once.

get execution duration for each finished job

Note that a job can execute multiple times, exg. jobID1 executed from start to finish twice, therefore two execution durations need to be calculated for jobID1.

jobID1 execution duration is time diff between 2023-11-02 11:40:37, 2023-11-02 10:39:37 AND time diff between 2023-11-02 11:45:37, 2023-11-02 11:43:37

jobID2 execution duration is time diff between 2023-11-02 12:00:37, 2023-11-02 10:43:37

CodePudding user response：

CREATE TABLE jobs(
  jobID INT,
  state VARCHAR(50),
  timestamp datetime);

INSERT INTO jobs VALUES
(1, 'inProgress', '2023-11-02 10:39:37'),
(1, 'some random log', '2023-11-02 10:40:37'),
(2, 'inProgress', '2023-11-02 10:43:37'),
(1, 'inProgress', '2023-11-02 10:44:37'),
(1, 'inProgress', '2023-11-02 10:45:37'),
(1, 'inProgress', '2023-11-02 10:50:00'),
(1, 'end', '2023-11-02 11:40:37'),
(1, 'inProgress', '2023-11-02 11:43:37'),
(1, 'inProgress', '2023-11-02 11:44:37'),
(1, 'end', '2023-11-02 11:45:37'),
(2, 'some random log', '2023-11-02 11:50:37'),
(2, 'end', '2023-11-02 12:00:37');

Based on the data sample above, here is a suggestion:

WITH end_state AS (
  SELECT ROW_NUMBER() OVER (ORDER BY jobID, timestamp) AS Rsnum,
         jobID, timestamp AS end_time
  FROM jobs 
  WHERE state='end'),
  es_wDelimiter AS (
SELECT e1.*, 
       IF(e2.end_time < e1.end_time, e2.end_time,NULL) AS esDelimiter
FROM end_state e1
LEFT JOIN end_state e2 ON e1.jobID=e2.jobID AND e1.end_time <> e2.end_time)

SELECT Rsnum,
       j.jobID,
       MIN(timestamp) AS StartTime,
       MAX(end_time) AS EndTime,
       TIMEDIFF(MAX(end_time),MIN(timestamp)) AS ExecutionTime
FROM jobs j 
 LEFT JOIN es_wDelimiter e
ON j.jobID=e.jobID 
AND j.timestamp > IFNULL(esDelimiter,0)
AND j.timestamp <= e.end_time
GROUP BY Rsnum, j.jobID
ORDER BY RsNum, j.jobID;

The query operation are as following:

Get all timestamp with state=end and assign with row number - the row numbers are used as new group.
Wrap the first query as common table expression and do another query to create a custom delimiter.
wrap the second query as another common table expression and do a LEFT JOIN query to it with jobs table with these conditions:

jobs.timestamp value is larger than the delimiter value that been set earlier (if the delimiter value is NULL then it will be assigned with 0).
jobs.timestamp value is smaller or the same as the end_time value we extracted from the first query.

Here is a fiddle for reference: https://dbfiddle.uk/RBEHLuji

CodePudding user response：

For a job that may have multiple instances the following approach "looks backward" to locate any previous ends for the same jobid, each instance of the jobid is also assigned using row_number(). Then these rows for "ends" are then joined to any rows representing "In Progress" with conditions to ensure they fit within "previous end" (or NULL) and "this end". Then the minimum start of each job instance is calculated, and the duration (in seconds) also made available.

WITH e AS (
    SELECT
          id
        , jobid
        , LAG(TIMESTAMP) OVER (PARTITION BY jobID ORDER BY TIMESTAMP) AS prev_end
        , TIMESTAMP AS end_time
        , row_number() OVER (PARTITION BY jobid ORDER BY TIMESTAMP) AS job_inst
    FROM jobs
    WHERE STATE = 'end'
    )
SELECT
      e.jobid
    , e.job_inst
    , min(p.start_time) start_time
    , e.end_time
    , TIMESTAMPDIFF(SECOND, min(p.start_time), e.end_time) AS du_seconds
FROM e
INNER JOIN (
    SELECT
          id
        , jobID
        , TIMESTAMP AS start_time
    FROM jobs
    WHERE STATE = 'inProgress'
    ) p ON e.jobid = p.jobid 
      AND (
               (p.start_time < e.end_time and e.prev_end IS NULL)
             OR
               (p.start_time > e.prev_end and p.start_time < e.end_time)
             )
GROUP BY
     e.jobid
   , e.job_inst
   , e.end_time
ORDER BY
     e.jobid
   , e.job_inst

jobid	job_inst	start_time	end_time	du_seconds
1	1	2023-11-02 10:39:37	2023-11-02 11:40:37	3660
1	2	2023-11-02 11:43:37	2023-11-02 11:45:37	120
2	1	2023-11-02 10:43:37	2023-11-02 12:00:37	4620

See fiddle

Once you have the jobs and instances of them available, the min/max start/end and total duration can be easily calculated from the result seen above (optionally the max(job_inst) could also be displayed).

SELECT
      jobid
    , MAX(job_inst)
    , MIN(start_time)
    , MAX(end_time)
    , SUM(du_seconds)
FROM (
       { result from above }
     ) d
GROUP BY
      jobid

CodePudding user response：

To count the number of jobs that are in progress (started but not yet finished):

  SELECT COUNT(DISTINCT jobID) AS inProgressJobs
    FROM jobs
    WHERE jobID NOT IN (
        SELECT jobID
        FROM jobs
        WHERE state = 'end'
    );

To calculate the execution duration for each finished job:

SELECT jobID, 
       MIN(CASE WHEN state = 'end' THEN timestamp END) AS end_time,
       MAX(CASE WHEN state = 'inProgress' THEN timestamp END) AS start_time
FROM jobs
GROUP BY jobID, state
HAVING start_time IS NOT NULL AND end_time IS NOT NULL;