Home > Blockchain >  Extra criterion in search_conditions on JOIN "ON" makes query significantly slower, despit
Extra criterion in search_conditions on JOIN "ON" makes query significantly slower, despit

Time:08-19

I have a table process_execution_data ped and I would like to self-join it.

Basically, this table holds the key, value pairs of existing variables during a execution piid of a process process_id.

I've written two very similar queries to do it, their only difference is that one has an extra search condition in the ON for each INNER JOIN.

Query1 - with x.process_id = y.process_id in each ON/JOIN:

SELECT
  queue.value
FROM
  process_execution_data queue
INNER JOIN process_execution_data status_code
ON status_code.process_id = queue.process_id
AND status_code.piid = queue.piid
AND status_code.key_name = 'status_code'
INNER JOIN process_execution_data ref_num
ON ref_num.process_id = queue.process_id
AND ref_num.piid = queue.piid
AND ref_num.key_name = 'ref_num'
WHERE
  queue.process_id = 132
  AND queue.key_name = 'queue'
  AND queue.value = "type_a"
  AND queue.created BETWEEN FROM_UNIXTIME(x) AND FROM_UNIXTIME(y)

Query2

SELECT
  queue.value
FROM
  process_execution_data queue
INNER JOIN process_execution_data status_code
ON status_code.piid = queue.piid
AND status_code.key_name = 'status_code'
INNER JOIN process_execution_data ref_num
ON ref_num.piid = queue.piid
AND ref_num.key_name = 'ref_num'
WHERE
  queue.process_id = 132
  AND queue.key_name = 'queue'
  AND queue.value = "type_a"
  AND queue.created BETWEEN FROM_UNIXTIME(x) AND FROM_UNIXTIME(y)

Query1 runs in 80s~, meanwhile Query2 runs in 1.3s.

What gives such huge difference? Both are using indexes properly (albeit different indexes). But if anything, shouldn't Query1 run faster, since it filters more rows and as well as "lower cost" (as pointed out by the EXPLAIN)?


Additional ped table info

SHOW CREATE TABLE `process_execution_data`
>>>
 CREATE TABLE `process_execution_data` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `key_name` varchar(512) NOT NULL,
   `value` longtext,
   `piid` varchar(512) NOT NULL,
   `created` datetime DEFAULT NULL,
   `modified` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
   `process_id` int(11) DEFAULT NULL,
   `flags` varchar(512) DEFAULT NULL,
   PRIMARY KEY (`id`),
   KEY `all` (`process_id`,`key_name`,`modified`,`created`),
   KEY `key_piid_modified_created` (`key_name`,`piid`,`modified`,`created`),
   KEY `processid_key` (`process_id`,`key_name`),
   KEY `processid` (`process_id`),
   KEY `key` (`key_name`),
   KEY `piid` (`piid`),
   KEY `created` (`created`),
   KEY `modified` (`modified`)
 ) ENGINE=InnoDB AUTO_INCREMENT=31134333 DEFAULT CHARSET=latin1
Query1 EXPLAIN FORMAT=JSON SELECT
>>>
 '{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "722877.86"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "status_code",
          "access_type": "ref",
          "possible_keys": [
            "all",
            "key_piid_modified_created",
            "processid_key",
            "processid",
            "key",
            "piid"
          ],
          "key": "all",
          "used_key_parts": [
            "process_id",
            "key_name"
          ],
          "key_length": "519",
          "ref": [
            "const",
            "const"
          ],
          "rows_examined_per_scan": 265312,
          "rows_produced_per_join": 265312,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "265312.00",
            "eval_cost": "53062.40",
            "prefix_cost": "318374.40",
            "data_read_per_join": "528M"
          },
          "used_columns": [
            "key_name",
            "piid",
            "process_id"
          ]
        }
      },
      {
        "table": {
          "table_name": "queue",
          "access_type": "ref",
          "possible_keys": [
            "all",
            "key_piid_modified_created",
            "processid_key",
            "processid",
            "key",
            "piid",
            "created"
          ],
          "key": "key_piid_modified_created",
          "used_key_parts": [
            "key_name",
            "piid"
          ],
          "key_length": "1028",
          "ref": [
            "const",
            "brain.status_code.piid"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 13265,
          "filtered": "4.13",
          "index_condition": "(`brain`.`queue`.`created` between <cache>(from_unixtime(1659841200)) and <cache>(from_unixtime(1660445999)))",
          "cost_info": {
            "read_cost": "321034.49",
            "eval_cost": "2653.12",
            "prefix_cost": "703615.79",
            "data_read_per_join": "26M"
          },
          "used_columns": [
            "key_name",
            "value",
            "piid",
            "created",
            "process_id"
          ],
          "attached_condition": "((`brain`.`queue`.`process_id` = `brain`.`status_code`.`process_id`) and (`brain`.`queue`.`value` = ''type_a''))"
        }
      },
      {
        "table": {
          "table_name": "ref_num",
          "access_type": "ref",
          "possible_keys": [
            "all",
            "key_piid_modified_created",
            "processid_key",
            "processid",
            "key",
            "piid"
          ],
          "key": "key_piid_modified_created",
          "used_key_parts": [
            "key_name",
            "piid"
          ],
          "key_length": "1028",
          "ref": [
            "const",
            "brain.status_code.piid"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 5661,
          "filtered": "35.27",
          "cost_info": {
            "read_cost": "16051.73",
            "eval_cost": "1132.38",
            "prefix_cost": "722877.86",
            "data_read_per_join": "11M"
          },
          "used_columns": [
            "key_name",
            "piid",
            "process_id"
          ],
          "attached_condition": "(`brain`.`ref_num`.`process_id` = `brain`.`status_code`.`process_id`)"
        }
      }
    ]
  }
}'
Query2 EXPLAIN FORMAT=JSON SELECT
>>>
'{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "974280.63"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "ref_num",
          "access_type": "ref",
          "possible_keys": [
            "key_piid_modified_created",
            "key",
            "piid"
          ],
          "key": "key_piid_modified_created",
          "used_key_parts": [
            "key_name"
          ],
          "key_length": "514",
          "ref": [
            "const"
          ],
          "rows_examined_per_scan": 529372,
          "rows_produced_per_join": 529372,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "66172.38",
            "eval_cost": "105874.40",
            "prefix_cost": "172046.78",
            "data_read_per_join": "1G"
          },
          "used_columns": [
            "key_name",
            "piid"
          ]
        }
      },
      {
        "table": {
          "table_name": "queue",
          "access_type": "ref",
          "possible_keys": [
            "all",
            "key_piid_modified_created",
            "processid_key",
            "processid",
            "key",
            "piid",
            "created"
          ],
          "key": "key_piid_modified_created",
          "used_key_parts": [
            "key_name",
            "piid"
          ],
          "key_length": "1028",
          "ref": [
            "const",
            "brain.ref_num.piid"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 26468,
          "filtered": "4.13",
          "index_condition": "(`brain`.`queue`.`created` between <cache>(from_unixtime(1659841200)) and <cache>(from_unixtime(1660445999)))",
          "cost_info": {
            "read_cost": "640554.02",
            "eval_cost": "5293.72",
            "prefix_cost": "940711.60",
            "data_read_per_join": "52M"
          },
          "used_columns": [
            "key_name",
            "value",
            "piid",
            "created",
            "process_id"
          ],
          "attached_condition": "((`brain`.`queue`.`value` = ''type_a'') and (`brain`.`queue`.`process_id` = 132))"
        }
      },
      {
        "table": {
          "table_name": "status_code",
          "access_type": "ref",
          "possible_keys": [
            "key_piid_modified_created",
            "key",
            "piid"
          ],
          "key": "key_piid_modified_created",
          "used_key_parts": [
            "key_name",
            "piid"
          ],
          "key_length": "1028",
          "ref": [
            "const",
            "brain.ref_num.piid"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 32027,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "27163.49",
            "eval_cost": "6405.54",
            "prefix_cost": "974280.63",
            "data_read_per_join": "63M"
          },
          "used_columns": [
            "key_name",
            "piid"
          ]
        }
      }
    ]
  }
}'

CodePudding user response:

To really have better optimization of your query, index should be more complete. You have partial vs all fields -- in order for your WHERE clause. I would suggest an index

ON ( Process_id, Key_Name, Value, Created)

This will optimize your primary FROM table regardless of the joins.

Your JOINs in the first query are using 3 columns, where your second query only 2 columns. One would probably think the second query processes slower as its possible to return multiple entries spanning multiple process IDs but having join on the PIID.

For optimization on your secondary JOIN tables, have an index on those criteria via ( Process_id, piid, key_name ).

And your second query, an index on ( key_name, piid ), so within each respective JOIN, all the same key_name index components are grouped (status_code vs ref_num), THEN the piids subsorted. I suspect this might give better performance.

I would also suggest removing redundant indexes. For example, you have indexes on

   KEY `all` (`process_id`,`key_name`,`modified`,`created`),
   KEY `key_piid_modified_created` (`key_name`,`piid`,`modified`,`created`),
   KEY `processid_key` (`process_id`,`key_name`),
   KEY `processid` (`process_id`),
   KEY `key` (`key_name`),
   KEY `piid` (`piid`),
   KEY `created` (`created`),
   KEY `modified` (`modified`)

In the above case, the index processid is redundant from the ALL and PROCESSID_KEY index.

ProcessID_Key is also redundant of the ALL index.

Having the stand-alone column indexes is not the greatest plan unless you are explicitly querying on ONLY that column and no other critical elements of importance. They just take up space and index pages of data.

With respect to order of columns in the index, it IS important. You are querying based on when records are CREATED, yet your ALL index has modified, THEN created column. If you think about it, if you are ever trying to filter based on a MODIFIED, it would have to be CREATED FIRST, and modified only on/after when created. So would it not make sense to order by the CREATED DATE first, then take into consideration modified.

Anyhow, just some observations and hopefully clarification on better indexing options for you.

CodePudding user response:

"Cost" is a good metric. But is it not perfect. It probably has no histogram of index distribution, especially with composite indexes. Nor does it take into account I/O needed, especially for the current state of the cache (buffer_pool).

These should be optimal for both queries:

queue:  INDEX(process_id, key_name, created,  piid)
status_code:  INDEX(piid, key_name, process_id)
ref_num:      INDEX(piid, key_name, process_id)

When adding a composite index, DROP index(es) with the same leading columns. That is, when you have both INDEX(a) and INDEX(a,b), toss the former.

EXPLAIN says that it stopped after 2 used_key_parts.

If value were not LONGTEXT, it could be included in the index. This is one of several serious problems with the EAV schema pattern.

Index Cookbook

  • Related