MySQL optimizer index choice for simple select query-CodePudding

I have a table called PendingExpense with a few simple columns but a lot of rows. I'm working on some queries for paginated GET responses, but running into some confusion working with the queries and the MySQL optimizer seemingly making a senseless decision to do a full index scan for the ORDER BY clause before filtering from the WHERE clause.

This is on MySQL version 8.0.23.

PendingExpense DDL (note, a companyId and loginCredentialId is how I specify a user in my schema):

create table PendingExpense
(
    ID                        bigint   auto_increment primary key,
    LOGINCREDENTIALID         int      null,
    COMPANYID                 int      null,
    DATE                      datetime null,
    -- ... other rows that don't pertain, e.g. amount, status, type, state, country, merchant
)

create index IN_PendingExpense_COMPANYID_ASC_LOGINCREDENTIALID_ASC
    on PendingExpense (COMPANYID, LOGINCREDENTIALID);

create index IN_PendingExpense_LOGINCREDENTIALID_ASC
    on PendingExpense (LOGINCREDENTIALID);

create index IN_PendingExpense_Date
    on PendingExpense (DATE);

Then here are the two queries I'm comparing, they are identical other than the index hint. I'm including the execution plans for both immediately below:

Query 1 (no hints):

explain analyze select id from PendingExpense
where COMPANYID = 1641 and LOGINCREDENTIALID = 2451
order by date DESC, id DESC
limit 101; -- takes 5.5 seconds

-> Limit: 101 row(s)  (cost=2356102.00 rows=101) (actual time=2292.676..4474.843 rows=101 loops=1)
    -> Filter: ((PendingExpense.LOGINCREDENTIALID = 2451) and (PendingExpense.COMPANYID = 1641))  (cost=2356102.00 rows=105) (actual time=2292.675..4474.818 rows=101 loops=1)
        -> Index scan on PendingExpense using IN_PendintExpense_Date (reverse)  (cost=2356102.00 rows=5660) (actual time=0.088..4371.774 rows=1491859 loops=1)

Query 2 (index hint):

explain analyze select id from PendingExpense use index (IN_PendingExpense_COMPANYID_ASC_LOGINCREDENTIALID_ASC)
where COMPANYID = 1641 and LOGINCREDENTIALID = 2451
order by date desc, id desc
limit 101; -- .184 seconds

-> Limit: 101 row(s)  (cost=9722.30 rows=101) (actual time=38.255..38.267 rows=101 loops=1)
    -> Sort: PendingExpense.`DATE` DESC, PendingExpense.ID DESC, limit input to 101 row(s) per chunk  (cost=9722.30 rows=27778) (actual time=38.254..38.259 rows=101 loops=1)
        -> Index lookup on PendingExpense using IN_PendingExpense_COMPANYID_ASC_LOGINCREDENTIALID_ASC (COMPANYID=1641, LOGINCREDENTIALID=2451)  (actual time=0.046..35.410 rows=14170 loops=1)

Essentially, I'm confused why MySql chooses to do the full index scan first before filtering on companyId / loginCredentialId when the index already exists for those two, causing significant inefficiencies. I'd much prefer to not have to specify index hints in my code/queries for cleanliness. I was under the impression MySQL generally chooses to run the where clause filtering first, especially if an index already exists for it.

Any help / hints / insight would be appreciated here. Thanks!

CodePudding user response：

Your optimized query would be one that includes the where clause FIRST, then secondarily the order by. So I would have an index on

( COMPANYID, LOGINCREDENTIALID, DATE, ID )

Company and credentials covers the where clause. Then the date and ID for the order by clause.

CodePudding user response：

This composite, covering, index should be perfect for that query:

INDEX(COMPANYID, LOGINCREDENTIALID,   -- in either order
      date, id)    -- last, in this order

The first two columns are tested via =, allowing the INDEX rows to be precisely found.

The last two rows can be scanned backwards to perfectly go through the index.

Since all the necessary rows are in the index (hence "covering" aka "Using index"), the data's BTree does not need to be touched.