Home > OS >  MySQL: giving the query hints about sort order
MySQL: giving the query hints about sort order

Time:10-21

I'm not clear how to search for this, if there's a duplicate feel free to point me to it.

I'm wondering if there's a way to tell mysql that something will be sorted before filtering so that it can perform the filter with a binary search instead of a linear search.

For example, consider a table with columns id, value, and created_at id is an auto-increment and created_at is a timestamp field with default of CURRENT_TIMESTAMP

then consider the following query:

SELECT * 
FROM `table`
WHERE created_at BETWEEN '2022-10-05' AND '2022-10-06'
ORDER BY id 

Because I have context on the data that mysql doesn't, namely that if id is sorted then created_at will also be sorted, I can conclude that we can binary search on created_at. However mysql does a full table scan for the filter as it's unaware of, or unwilling to assume this fact. The explain on the query on my test dataset shows that it's scanning all 50 rows to return the 24 that match the filter, when it's possible to do it by only scanning approximately log2(50) rows. This isn't a huge difference for my test dataset but on larger data it can have an impact.

I'll note that the obvious answer here is to add an index on created_at, but on more real life queries that's not always possible. For example if you were filtering on another indexed column it wouldn't be able to use that created_at index, but we might still be able to make assumptions about the ordering based on other order bys.

Anyway, after all that setup my question is: Is there a way that I can tell MySQL that I know that this data is already sorted so that it need not perform a table scan? Something similar to FORCE INDEX that can be used to overwrite the behaviour of picking an index for a query

CodePudding user response:

The answer is no in general.

InnoDB queries read rows in order by the index used. So in your case if there's an index on created_at, it'll read rows in that order, ascending. There's no way to tell the optimizer that this matches the order of id too, and the optimizer won't make that assumption.

So the bottom line is that it'll have to perform a filesort on the matching rows to guarantee they're in order by id.

The comment above suggests ORDER BY created_at would solve the problem in the example you show. That is, if you know that the order of created_at is the same order as id, then just ORDER BY created_at, and the filesort can be skipped. That is, the optimizer knows the ORDER BY you requested is actually the order it read the rows from the index, so sorting is a no-op.

But I assume your example was only one case among many potential cases, and it might not be the right solution to use in other cases.

But why was the query doing a table-scan?

In the example you give of a table-scan of 50 rows, it's possible the optimizer decided not to use the index because the table-scan is so little work that the extra indirection of using the index isn't worth it. This is why we need to fill a table with at least a few hundred rows during testing, before we know if an index improves the query.

Another possible reason for the table-scan is that the range of dates covers a significantly large part of the table. In my experience, if the condition matches over 20% of the table, then the optimizer says, "meh, not worth the effort to use the index, so I'll just do a table-scan." That 20% figure is not an officially documented threshold, it's just my observation.

FORCE INDEX might help the latter case. FORCE INDEX really just tells the optimizer to assume a table-scan is infinitely costly, so if the named index is relevant at all to the search conditions, then use the index instead of a table-scan. It's possible though to use FORCE INDEX and name an index that is irrelevant to the search conditions. In that case, the optimizer still won't use the index, it'll just feel shame over having to do a table-scan.

CodePudding user response:

Because I have context on the data that mysql doesn't, namely that if id is sorted then created_at will also be sorted...

Give MySQL that information: order by created_at, id and make sure created_at is indexed. This will allow MySQL to use the created_at index to filter and also do most of the ordering. If this is still slow, try adding a composite index of (created_at, id).

Also, upgrade MySQL. 5.6 reached the end of its life last year. MySQL 8 is considerably better at optimizing queries.

  • Related