Home > Back-end >  INDEX not being utilized in MySQL when larger dataset
INDEX not being utilized in MySQL when larger dataset

Time:07-01

My Query looks like this

EXPLAIN SELECT 
    r.owner_id, 
    r.owner_address, 
    r.owner_platform,
    r.updated_at 
FROM some_owner_table as r 
WHERE 
    r.updated_at > '2022-09-16 22:16:38.832' 
ORDER BY 
    r.updated_at DESC LIMIT 200;

The result is

# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
'1', 'SIMPLE', 'r', NULL, 'range', 'INDEX_by_updated_at', 'INDEX_by_updated_at', '6', NULL, '1', '100.00', 'Using index condition'

However if we use a different date that, I think, increased the number of results we get :

'1', 'SIMPLE', 'r', NULL, 'ALL', 'INDEX_by_updated_at', NULL, NULL, NULL, '263', '37.64', 'Using where; Using filesort'

Using filesort seems problematic in terms of performance. It's not longer using Using index condition.

Is this how indexing really works or can we do something to further optimize our queries for this table?

EDIT: Table has 263 total rows.

EDIT: Create query:

CREATE TABLE `some_owner_table` (
  `owner_id` bigint(20) NOT NULL,
  `owner_address` bigint(20) NOT NULL,
  `owner_platform` int(11) NOT NULL,
  `updated_at` timestamp(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3) ON UPDATE CURRENT_TIMESTAMP(3),
  PRIMARY KEY (`owner_id`,`owner_platform`),
  KEY `INDEX_by_updated_at` (`updated_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CodePudding user response:

MySQL's optimizer generally chooses to skip using the index if can infer that your condition would match a large enough portion of the table. In my experience, the threshold is about 20%, but this is not an official feature, and it may be different if the MySQL Server code changes.

The reason is that it actually takes more work per row to do an index lookup, then from that index entry fetch the whole row. The optimizer may assume that there's a point at which it's more economical to just walk the table row by row in primary key order, and keep the rows that match the condition.

But if the specific value you're searching for occurs on a small subset of the rows, then it's more economical to select those few index entries, then fetch the corresponding rows.

If you think the optimizer has made the wrong choice, you can use the FORCE INDEX hint, which makes the optimizer treat a table-scan as infinitely costly, so if the index is relevant at all, it'll use it.

It might be worthwhile, for example, to avoid the filesort. That is, if you force the query to scan rows by reading the index on updated_at, then sorting becomes a no-op, and it will avoid the filesort.

Another idea: If you're testing a very small dataset, the optimizer could reason that using an index doesn't matter, because the number of rows is going to fit in such a small number of pages in RAM anyway, and the cost of searching or sorting will be trivial.

This is why you should test optimization with a larger sample size, because the optimizer might make different choices for a trivially-sized dataset.

CodePudding user response:

The query planner in MySQL / MariaDB makes cost-based choices about whether to use indexes. Oversimplifying, it guesses how much CPU and IO time it will take to satisfy the query with or without using indexes, and chooses the smallest guess. Like @BillKarwin said, those choices can be very different for small (hundreds of rows) and large (millions) tables. So don't waste a whole ton of your time EXPLAINing queries on small tables.

You have the WHERE col > constant ORDER BY col DESC LIMIT 200 pattern in your query. When your query doesn't use an index the server does these things, roughly.

  • scans the table row-by-row looking for rows matching the WHERE clause.
  • places those rows into an internal data structure, a sort of lightweight table. That table, when mentioned in EXPLAIN output, is called a "file". It's not a file-system file.
  • when the table scan is done, it sorts that "file" as specified by ORDER BY.
  • finally, it sends you the first 200 rows from the sorted "file" and discards the rest.

But you have an index on the column in question. When it uses the index the server can work more efficiently. It

  • random-accesses the index for the first row matching the WHERE clause.
  • scans sequentially through the index until the last matching row
  • retrieves and sends you each matching row
  • and stops when it has sent 200 rows.

No sorting needed. (Hooray.)

If the query you mentioned has to be faster than it is in production, you can try recreating your index in descending order like this:

ALTER TABLE some_owner_table 
     DROP KEY INDEX_by_updated_at,
      ADD KEY INDEX_by_updated_at 
          (updated_at DESC);

This puts your index in descending order, matching your ORDER BY clause exactly. It is slightly less expensive to scan an index forward rather than backward, and this revised index lets that happen.

If it's still not fast enough, you can try creating a covering index. A covering index allows the server to satisfy the query completely from the index, rather than going from the index to the table* to look up every row's data. That looks like this.

ALTER TABLE some_owner_table 
     DROP KEY INDEX_by_updated_at,
      ADD KEY INDEX_by_updated_at 
           (updated_at DESC, owner_id, owner_address, owner_platform);

An index like this takes tablespace (SSD or disk space) and slightly slows down INSERTs and UPDATEs, because the server must update the index as well as the table. But tablespace is cheap. And most tables don't get nearly as many INSERTs or UPDATEs as they do SELECTs.

* In InnoDB, the table is actually stored inside the primary key. It's called a clustered index.

  • Related