Home > Blockchain >  How can I optimize this range-based query?
How can I optimize this range-based query?

Time:03-12

I have a MySQL database with four columns: library_name, a string, function_name, a string, low_num, a number, and high_num, a number. low_num and high_num together represent a range, and all the ranges belonging to the same library (having the same library_name) are mutually exclusive. I need to run support a query where, given a ~14,000 str-num pairs, for each pair it returns the row where low_num <= num < high_num and library_name = str. I've made my primary key (library_name, low_num), and my current fastest query is (select * from table where library_name = $name and $num between low_num and high_num limit 1) union (select ...) .... I.e., each pair gets its own query, and then they all get unioned together. However, it's still quite slow (takes around 20 seconds). I also run into memory issues when I do 1 query like this for the 14,000 pairs, so I'm having to break it up into ~14 queries to find 1,000 pairs each (but even one of these 1,000 pair queries on its own takes like 4 seconds). Any ideas how to speed up this query?

SHOW CREATE TABLE: https://db-fiddle.com/f/bjB1zLez2suhdCzt6itge5/0

EXPLAIN SELECT (for just one of the selects in the union): https://db-fiddle.com/f/hyq7aKN89soLZDPyxyJu5T/0

CodePudding user response:

Okay here's the EXPLAIN:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
#3 UNION sym_large NULL range PRIMARY PRIMARY 1026 NULL 14000 11.11 Using where

The PRIMARY key of the table is on (library_name, low_nm) which are VARCHAR(255) and INT respectively. The key_len of 1026 indicates it is using both columns of the PRIMARY key: 4 bytes for the INT and 255*4 for the utf8mb4 string plus 2 bytes for the string length.

Probably filtering on both low_num and high_num is not possible, because both are inequality conditions. Usually, an index only helps to reduce the examined rows for N columns in equality conditions, followed by at most one column in an inequality condition. Your conditions are like one equality, followed by two inequality:

where library_name = $name and low_num <= $num and high_num >= $num
               (equality)          (inequality)         (inequality)

So only the first two terms of the search can be optimized by the index. The third search term must be applied "the hard way," by testing each row matched by the first two terms.

There's a possibility that applying the search to high_num instead of low_num would be better for reducing the number of examined rows, but that depends on the data. That is, if on average there are a smaller number of rows between $num and high_num than between low_num and $num, it could be more efficient.

If that's true, then using a different index on (library_name, high_num) instead may help. But it still would be limited to searching on two out of the three conditions. There's no way to filter on both inequalities in the same query.

You could also avoid the UNION if you load your 14,000 name/number pairs into a temporary table and do a JOIN instead of thousands of subqueries.

CREATE TABLE tmp_search_pairs (
  library_name VARCHAR(255) NOT NULL,
  num INT UNSIGNED NOT NULL,
  KEY (library_name, num)
);

...insert your search data into the temp table...

SELECT s.*
FROM sym_large AS s
JOIN tmp_search_pairs AS p
  ON s.library_name = p.library_name 
  AND s.low_num <= p.num 
  AND s.high_num >= p.num;

I think it's personal preference to use that inequality syntax vs. the BETWEEN syntax. They should optimize the same.

CodePudding user response:

SELECT *
    FROM ( SELECT * FROM sym_large
            WHERE library_name = ?
              AND low_num >= ?
            ORDER BY low_num
            LIMIT 1
         ) AS x
    WHERE high_num <= ?;

The inner query is very fast because of the PK, and finds one row. The outer part verifies that high_num has not been violated. If you are sure there will always be a match, then simply do the inner query.

  • Related