How to optimize datetime comparisons in mysql in where clause-CodePudding

CONTEXT

I have a large table full of "documents" that are updated by outside sources. When I notice the updates are more recent than my last touchpoint I need to address these documents. I'm having some serious performance issues though.

EXAMPLE CODE

select count(*) from documents;

gets me back 212,494,397 documents in 1 min 15.24 sec.

select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);

which is apx the actual query gets me 55,988,860 in 14 min 36.23 sec.

select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE) limit 1;

notably takes about 15 minutes as well. (this was surprising to me)

THE PROBLEM

How do I perform the updated_at > last_indexed_at in a more reasonable time?

DETAILS

I'm pretty certain that my query is, in some way, not sargable. Unfortunately, I can't find what about this query prevents it from being executed on a row independent basis.

select count(*) 
from documents 
where last_indexed_at is null or updated_at > last_indexed_at;

doesn't do any better.

nor does

select count( distinct( id ) ) 
from documents 
where last_indexed_at is null or updated_at > last_indexed_at limit 1;

nor does

select count( distinct( id ) ) 
from documents limit 1;

EDIT: FOLLOW UP REQUESTED DATA

This question only involves one table (thankfully) in a rails project, so we conveniently have the rails definition for the table.

/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `documents` (
  `id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `document_id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `document_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `locale` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `allowed_ids` text COLLATE utf8mb4_unicode_ci NOT NULL,
  `fields` mediumtext COLLATE utf8mb4_unicode_ci,
  `created_at` datetime(6) NOT NULL,
  `updated_at` datetime(6) NOT NULL,
  `last_indexed_at` datetime(6) DEFAULT NULL,
  `deleted_at` datetime(6) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `index_documents_on_document_type` (`document_type`),
  KEY `index_documents_on_locale` (`locale`),
  KEY `index_documents_on_last_indexed_at` (`last_indexed_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

SELECT VERSION(); got me 5.7.27-30-log

And probably most import,

explain select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);

gets me exactly

 ---- ------------- ----------- ------------ ------ --------------- ------ --------- ------ ----------- ---------- ------------- 
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows      | filtered | Extra       |
 ---- ------------- ----------- ------------ ------ --------------- ------ --------- ------ ----------- ---------- ------------- 
|  1 | SIMPLE      | documents | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 208793754 |   100.00 | Using where |
 ---- ------------- ----------- ------------ ------ --------------- ------ --------- ------ ----------- ---------- -------------

CodePudding user response：

Add a covering INDEX

If you had INDEX(last_indexed_at, updated_at), the 15-minute queries might run somewhat faster. (The order of the columns does not matter in this case.)

Assuming both of those columns are columns in the table. If so, then the query must read every row. (I don't know if the term "sargable" covers this situation.)

The INDEX I suggest will be faster because it is "covering". By reading only the index, there is less I/O.

The repeated 15 minutes is probably because innodb_buffer_pool_size was not big enough to hold the entire table. So, it was I/O-bound. My INDEX will be smaller, hence (hopefully) small enough to fit in the buffer_pool. So, it will be faster, and even faster on the second run.

Slow OR

OR is usually a terrible slowdown. But I don't think it matters here.

If you were to initialize last_indexed_at to some old date (say, '2000-01-01'), you could get rid of the COALESCE or OR.

Another way to clean it up is

SELECT  SUM(last_indexed_at IS NULL)  
        SUM(updated_at > last_indexed_at) AS "Need indexing"
    FROM t;

I still need the index. SUM(boolean expression) sees the expression as 0 (false or NULL) or 1 (TRUE).

Meanwhile, I don't think the COUNT(DISTINCT id) is any different than COUNT(*). And the pair of SUMs should also give you the value.

Again, I am depending on "covering" being the trick.

CodePudding user response：

You can, if you're on a recent MySQL version (5.7 ), add a generated column to your table containing your search expression, then index that column.

ALTER TABLE t 
 ADD COLUMN needs_indexing TINYINT 
  GENERATED ALWAYS AS 
     (CASE WHEN last_indexed_at IS NULL THEN 1
           WHEN updated_at > last_indexed_at THEN 1
           ELSE 0 END) VIRTUAL;
ALTER TABLE t 
  ADD INDEX needs_indexing (needs_indexing);

This uses drive space for the index, but not in your table.

Then you can do SELECT SUM(needs_indexing) FROM t to get the number of items matching your criterion.

But: you don't have to count all the items to know you need to reindex some items. Doing a COUNT(*) on a large InnoDB table is very expensive as you have discovered. You can do this:

SELECT EXISTS (SELECT 1 FROM t WHERE needs_indexing = 1) something_needs_indexing;

You'll get 1 or 0 from this query very quickly. 1 means you have at least one row meeting your criteria.

And, of course, your indexing job can do

SELECT * FROM t WHERE needs_indexing=1 LIMIT 1;

or whatever makes sense. That will also be fast.

CodePudding user response：

Oh! MySQL 5.7 introduced Generated Columns — which gives us a way of indexing expressions!