I have a mySQL database with several tables. But one tabel contains af 1,400,000 rows. I need to get the 50,000 rows with the highest value in one field. A field that counts visitors.
SELECT uid, title, number, views
FROM ct
WHERE hidden = 0
AND deleted = 0
AND number > 0
AND status LIKE 'active'
order by views desc
limit 0, 50000
It is extremely slow. The database is InnoDB and title, number and views are all indexed.
How can I speed up the selection?
From EXPLAIN:
1 SIMPLE ct ALL number_index NULL NULL NULL 1465440 Using where; Using filesort
After indexing Status From EXPLAIN:
1 SIMPLE ct range number_index,status status 302 NULL 732720 Using index condition; Using where; Using filesort
CodePudding user response:
How many rows do you have matching
WHERE hidden = 0
AND deleted = 0
AND number > 0
AND status LIKE 'active'
?
If the answer is more than 70000 or so, then the short answer is that there is nothing you can do to speed things up. If it is significantly less then you will get some improvement with an index on hidden, deleted, number and status - how much of a speed up depends on the order of these attributes in the index and the cardinality for each of the attributes (nint: you want your highest cardinality entries first).
CodePudding user response:
This composite index may speed it up:
INDEX(hidden, deleted, status, number, views)
The real problem is shoveling 50K rows to the client. What the heck will the client do with that many rows?
However, "the 50,000 rows with the highest value in one field" is not what your query finds. Perhaps you first need to find "the highest value" in that field. Then search for all rows with that value in that field??
CodePudding user response:
Showing 50k rows is always going to be costly. Just transferring the result set over the network is going to take a while. So there's a limit to how much you can "optimize" the query if the result set is that large. I'd seriously reconsider a design that required a result of 50k rows.
As for the best index for this query, the usual rule applies: use a compound index, consisting of the columns in equality conditions first, then ONE column used in range conditions. In your case, I would suggest:
alter table cte add index (hidden, deleted, status, number)
The first three may be in any order, since they're all equality conditions. Then number
because it's a range condition. There's no way to optimize the ORDER BY, because the range condition spoils that.
A comment asked about partitioning or other methods of optimizing. I don't think it's likely that partitioning pruning will help.
The other method of optimizing I'd use is archiving. How many of the 1.4 million rows are hidden or deleted? Why not move those rows to another table, or to cold storage, or simply delete them. That would keep the active table smaller, and easier to keep in the buffer pool.