Home > other >  MariaDB 5.5.68, prevent toxic selects?
MariaDB 5.5.68, prevent toxic selects?

Time:10-26

I know that this MariaDB version 5.5.68 is really out of date, but I have to stay with this old version for a while.

Is there a way to prevent toxic selects, possibly blocking MyISAM tables for a longer time (minutes)? The thing is that the select creates a READ BLOCK on the whole MyISAM table and further inserts wait until they're all gone. So the long running select starts to block the system.

Take this example table:

CREATE TABLE `tbllog` (
  `LOGID` bigint unsigned NOT NULL auto_increment,
  `LOGSOURCE` smallint unsigned default NULL,
  `USERID` int unsigned default NULL,
  `LOGDATE` datetime default NULL,
  `SUBPROVIDERID` int unsigned default NULL,
  `ACTIONID` smallint unsigned default NULL,
  `COMMENT` varchar(255) default NULL,
  PRIMARY KEY (`LOGID`),
  KEY `idx_LogDate` (`LOGDATE`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

The following select works fine until less than 1 Mio entries in the table (the customers set the date range):

SELECT * 
FROM tbllog 
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' 
AND subproviderid=1 
ORDER BY logid 
LIMIT 500;

But it becomes toxic if there are 10 Mio entries or more in the table. Then it starts to run for minutes, consumes a lot of memory and starts blocking the app.

This is the query plan with ~600.000 entries in the table:

 ------ ------------- -------- ------- --------------- --------- --------- ------ ------ ------------- 
| id   | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
 ------ ------------- -------- ------- --------------- --------- --------- ------ ------ ------------- 
|    1 | SIMPLE      | tbllog | index | idx_LogDate   | PRIMARY | 8       | NULL |  624 | Using where |
 ------ ------------- -------- ------- --------------- --------- --------- ------ ------ ------------- 

The thing is, that I need to know if this becomes toxic or not before execution. So maybe I can warn the user that this might block the system for a while or even deny execution.

I know that InnoDB might not have this issue, but I don't know the drawbacks of a switch yet and I think it might be best to stay for the moment.

I tried to do a simple SELECT COUNT(*) FROM tbllog WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' AND subproviderid=1 before (removing LIMIT and ORDER BY), but it is not really much faster than the real query and produces double the load in the worst case.

I also considered a worker thread (like mentioned here). But this is a relevant change to the whole system, too. InnoDB would be less impact I think.

Any ideas about this issue?

CodePudding user response:

Try creating a multi-column index specifically for your query.

CREATE INDEX sub_date_logid ON tbllog (subproviderid, logdate, logid);

This index should satisfy the WHERE filters in your query directly. Then it should present the rows in logid order so your ORDER BY ... LIMIT clauses don't have to sort the whole table. Will this help on long-dead MariaDB 5.5 with MyISAM? Hard to say for sure.

If it doesn't solve your performance problem, keep the multicolumn index and try doing the ORDER BY...LIMIT on the logid values rather than all the rows.

SELECT * 
  FROM tbllog 
 WHERE logid IN (
        SELECT logid
          FROM tbllog
         WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' 
           AND subproviderid=1 
         ORDER BY logid 
         LIMIT 500 )
 ORDER BY logid;

This can speed things up because it lets MariaDB sort just the logid values to find the ones it wants. Then the outer query fetches only the 500 rows needed for your result set. Less data to sort = faster.

CodePudding user response:

Your EXPLAIN report shows that it's doing an index-scan on the primary key index. I believe this is because the range of dates is too broad, so the optimizer thinks that it's not much help to use the index instead of simply reading the whole table. By doing an index-scan of the primary key (logid), the optimizer can at least ensure that the rows are read in the order you requested in your ORDER BY clause, so it can skip sorting.

If I test your query (I created the table and filled it with 1M rows of random data), but make it ignore the primary key index, I get this EXPLAIN report:

mysql> explain SELECT *  FROM tbllog IGNORE INDEX(PRIMARY) WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:
 ---- ------------- -------- ------------ ------- --------------- ------------- --------- ------ -------- ---------- ---------------------------------------------------- 
| id | select_type | table  | partitions | type  | possible_keys | key         | key_len | ref  | rows   | filtered | Extra                                              |
 ---- ------------- -------- ------------ ------- --------------- ------------- --------- ------ -------- ---------- ---------------------------------------------------- 
|  1 | SIMPLE      | tbllog | NULL       | range | idx_LogDate   | idx_LogDate | 6       | NULL | 271471 |    10.00 | Using index condition; Using where; Using filesort |
 ---- ------------- -------- ------------ ------- --------------- ------------- --------- ------ -------- ---------- ---------------------------------------------------- 

This makes it use the index on the logdate, so it examine fewer rows, according to the proportion matched by the date range condition. But the resulting rows must be sorted ("Using filesort" in the Extra column) before it can apply the LIMIT.

This won't help at all if your range of dates covers the whole table anyway. In fact, it will be worse, because it will access rows indirectly by the logdate index, and then it will have to sort rows. This solution helps only if the range of dates in the query matches a small portion of the table.

A somewhat better index is a compound index on (subproviderid, logdate).

mysql> alter table tbllog add index (subproviderid, logdate);

mysql> explain SELECT *  FROM tbllog IGNORE INDEX(PRIMARY) WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'  AND subproviderid=1  ORDER BY logid  LIMIT 500;
 ---- ------------- -------- ------------ ------- --------------------------- --------------- --------- ------ ------- ---------- --------------------------------------- 
| id | select_type | table  | partitions | type  | possible_keys             | key           | key_len | ref  | rows  | filtered | Extra                                 |
 ---- ------------- -------- ------------ ------- --------------------------- --------------- --------- ------ ------- ---------- --------------------------------------- 
|  1 | SIMPLE      | tbllog | NULL       | range | idx_LogDate,SUBPROVIDERID | SUBPROVIDERID | 11      | NULL | 12767 |   100.00 | Using index condition; Using filesort |
 ---- ------------- -------- ------------ ------- --------------------------- --------------- --------- ------ ------- ---------- --------------------------------------- 

In my test, this helps the estimate of rows examined drop from 271471 to 12767 because they're restricted by subproviderid, then by logdate. How effective this is depends on how frequently subproviderid=1 is matched. If that's matched by virtually all of the rows anyway, then it won't be any help. If there are many different values of subproviderid and they each have a small fraction of rows, then it will help more to add this to the index.

In my test, I made an assumption that there are 20 different values of subproviderid with equal frequency. That is, my random data inserted round(rand()*20) as the value of subproviderid on each row. Thus it is expected that adding subproviderid resulted in 1/20th of the examined rows in my test.

To choose the order of columns listed in the index, columns referenced in equality conditions must be listed before the column referenced in range conditions.

There's no way to get a prediction of the runtime of a query. That's not something the optimizer can predict. You should block users from requesting a range of dates that will match too great a portion of the table.

  • Related