CONTEXT
I have a large table full of "documents" that are updated by outside sources. When I notice the updates are more recent than my last touchpoint I need to address these documents. I'm having some serious performance issues though.
EXAMPLE CODE
select count(*) from documents;
gets me back 212,494,397 documents in 1 min 15.24 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
which is apx the actual query gets me 55,988,860 in 14 min 36.23 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE) limit 1;
notably takes about 15 minutes as well. (this was surprising to me)
THE PROBLEM
How do I perform the
updated_at > last_indexed_at
in a more reasonable time?
DETAILS
I'm pretty certain that my query is, in some way, not sargable. Unfortunately, I can't find what about this query prevents it from being executed on a row independent basis.
select count(*)
from documents
where last_indexed_at is null or updated_at > last_indexed_at;
doesn't do any better.
nor does
select count( distinct( id ) )
from documents
where last_indexed_at is null or updated_at > last_indexed_at limit 1;
nor does
select count( distinct( id ) )
from documents limit 1;
EDIT: FOLLOW UP REQUESTED DATA
This question only involves one table (thankfully) in a rails project, so we conveniently have the rails definition for the table.
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `documents` (
`id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`locale` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`allowed_ids` text COLLATE utf8mb4_unicode_ci NOT NULL,
`fields` mediumtext COLLATE utf8mb4_unicode_ci,
`created_at` datetime(6) NOT NULL,
`updated_at` datetime(6) NOT NULL,
`last_indexed_at` datetime(6) DEFAULT NULL,
`deleted_at` datetime(6) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_documents_on_document_type` (`document_type`),
KEY `index_documents_on_locale` (`locale`),
KEY `index_documents_on_last_indexed_at` (`last_indexed_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
SELECT VERSION();
got me 5.7.27-30-log
And probably most import,
explain select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
gets me exactly
---- ------------- ----------- ------------ ------ --------------- ------ --------- ------ ----------- ---------- -------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- ----------- ------------ ------ --------------- ------ --------- ------ ----------- ---------- -------------
| 1 | SIMPLE | documents | NULL | ALL | NULL | NULL | NULL | NULL | 208793754 | 100.00 | Using where |
---- ------------- ----------- ------------ ------ --------------- ------ --------- ------ ----------- ---------- -------------
CodePudding user response:
Add a covering INDEX
If you had INDEX(last_indexed_at, updated_at)
, the 15-minute queries might run somewhat faster. (The order of the columns does not matter in this case.)
Assuming both of those columns are columns in the table. If so, then the query must read every row. (I don't know if the term "sargable" covers this situation.)
The INDEX
I suggest will be faster because it is "covering". By reading only the index, there is less I/O.
The repeated 15 minutes is probably because innodb_buffer_pool_size
was not big enough to hold the entire table. So, it was I/O-bound. My INDEX
will be smaller, hence (hopefully) small enough to fit in the buffer_pool. So, it will be faster, and even faster on the second run.
Slow OR
OR
is usually a terrible slowdown. But I don't think it matters here.
If you were to initialize last_indexed_at
to some old date (say, '2000-01-01'), you could get rid of the COALESCE
or OR
.
Another way to clean it up is
SELECT SUM(last_indexed_at IS NULL)
SUM(updated_at > last_indexed_at) AS "Need indexing"
FROM t;
I still need the index. SUM(boolean expression)
sees the expression as 0 (false or NULL) or 1 (TRUE).
Meanwhile, I don't think the COUNT(DISTINCT id)
is any different than COUNT(*)
. And the pair of SUMs
should also give you the value.
Again, I am depending on "covering" being the trick.
CodePudding user response:
You can, if you're on a recent MySQL version (5.7 ), add a generated column to your table containing your search expression, then index that column.
ALTER TABLE t
ADD COLUMN needs_indexing TINYINT
GENERATED ALWAYS AS
(CASE WHEN last_indexed_at IS NULL THEN 1
WHEN updated_at > last_indexed_at THEN 1
ELSE 0 END) VIRTUAL;
ALTER TABLE t
ADD INDEX needs_indexing (needs_indexing);
This uses drive space for the index, but not in your table.
Then you can do SELECT SUM(needs_indexing) FROM t
to get the number of items matching your criterion.
But: you don't have to count all the items to know you need to reindex some items. Doing a COUNT(*)
on a large InnoDB table is very expensive as you have discovered. You can do this:
SELECT EXISTS (SELECT 1 FROM t WHERE needs_indexing = 1) something_needs_indexing;
You'll get 1 or 0 from this query very quickly. 1 means you have at least one row meeting your criteria.
And, of course, your indexing job can do
SELECT * FROM t WHERE needs_indexing=1 LIMIT 1;
or whatever makes sense. That will also be fast.
CodePudding user response:
Oh! MySQL 5.7 introduced Generated Columns — which gives us a way of indexing expressions!