Home > Software engineering >  Optimizing query that selects on the result of a group by
Optimizing query that selects on the result of a group by

Time:09-17

I have a table that contains pipeline jobs data. A pipeline is composed of many jobs that run independently, and each of them can finish at it's own pace. Once the pipelines are finished, they are archived by setting one of the columns to 1. I want to get the list of jobs of the pipelines whose state is "Done" for all their jobs.

Let's say that my table looks like (sample data shown):

mysql> select id, pipeline, archived, state from jobs where archived=0 limit 4;
 --------- ----------- ---------- ------- 
| id      | pipeline  | archived | state |
 --------- ----------- ---------- ------- 
| 8572387 | pipeline1 |        0 | Done  |
| 8572388 | pipeline1 |        0 | Done  |
| 8572389 | pipeline2 |        0 | Done  |
| 8572390 | pipeline2 |        0 | Fail  |
 --------- ----------- ---------- ------- 
4 rows in set (0.00 sec)

I managed to get the list of failed pipelines:

mysql> select distinct(pipeline) from jobs where archived=0 group by pipeline, state having state!='Done';
 ----------- 
| pipeline  |
 ----------- 
| pipeline2 |
 ----------- 
1 row in set (0.01 sec)

And I even managed to get the answer I'm looking for (real data shown):

select j1.id 
from jobs j1 
where j1.archived=0 
and j1.pipeline not in ( select distinct(j2.pipeline) 
                         from jobs j2 
                         where j2.archived=0 
                         group by j2.pipeline, j2.state having j2.state!='Done'
                        );

 --------- 
| id      |
 --------- 
| 8583200 |
| 8583201 |
| 8583202 |
| 8583203 |
.
.
.
| 8584305 |
| 8584306 |
 --------- 
1107 rows in set (18.77 sec)

My issue is that the first query runs in 0.01s for the real data, but as soon as I add the second select, the time goes up dramatically. This last query took 19s having a total of 2 failed pipelines out of a total of 4, each one having around 500 jobs. When I'm doing this with a full dataset with hundreds of pipelines... it takes too much time.

I'm sure it can be done a lot quicker, in less than 1s. But I cannot manage to get it right :-( Where is my query being stuck?

For reference, the query plan is:

mysql> describe select j1.id from jobs j1 where j1.archived=0 and j1.pipeline not in (select distinct(j2.pipeline) from jobs j2 where j2.archived=0 group by j2.pipeline, j2.state having j2.state!='Done');
 ---- -------------------- ------- ------ --------------- ---------- --------- ------- ------ ---------------------------------------------- 
| id | select_type        | table | type | possible_keys | key      | key_len | ref   | rows | Extra                                        |
 ---- -------------------- ------- ------ --------------- ---------- --------- ------- ------ ---------------------------------------------- 
|  1 | PRIMARY            | j1    | ref  | archived      | archived | 2       | const | 2306 | Using where                                  |
|  2 | DEPENDENT SUBQUERY | j2    | ref  | archived      | archived | 2       | const | 2306 | Using where; Using temporary; Using filesort |
 ---- -------------------- ------- ------ --------------- ---------- --------- ------- ------ ---------------------------------------------- 
2 rows in set (0.00 sec)

CodePudding user response:

You could rewrite it to something like this

A combined INDEX on (pipeline,archived ,state) should speed this up.

The order of the Index column are vital and depend on the granularity of data, so you can play with it, to see which gives better results

SELECT 
    j1.id
FROM
    jobs j1
WHERE
    j1.archived = 0
    AND NOT EXISTS
    (SELECT 1 FROM jobs j2 WHERE j2.pipeline = j1.pipeline
             AND
                j2.archived = 0
            AND j2.state != 'Done')
  • Related