Home > OS >  ORDER BY COUNT is slow
ORDER BY COUNT is slow

Time:02-01

A table (phototag) consists of two foreign key columns (photoid, tagid). I want to get the most related photos based on their tags.

There are 4 million photos with 5-10 tags. For example, the photoid 10009 has 6 tags. I need photos that have similar tags.

SELECT photoid
FROM phototag
WHERE photoid != 10009
  AND tagid IN (21192, 3501, 35286, 21269, 16369, 48136)
GROUP BY photoid
ORDER BY COUNT(photoid) DESC
LIMIT 24;

Without ORDER BY COUNT the query is super fast.

I tried but no result:

  • Optimizing table
  • making a primary key based on two columns
  • indexing columns separately
  • switching InnoDB to MyISAM

UPDATE

Here is the result with explain prefix

enter image description here

CodePudding user response:

Without ORDER BY clause, if the database finds 24 records that satisfy the condition, it may return the results before reading all table records. With the ORDER BY clause, the database must read all records in order to sort them according to the count result.

The IN operator makes queries difficult to optimize. Database needs to repeatedly access the index as many times as the number of variables. Plan generators are easy to induce to do table scans in these situations.

I don't think creating index is helpful. However, phototag(tagid, photoid) is better than phototag(photoid, tagid) because the database cannot use the index when the condition is negative.

CodePudding user response:

Count queries are notoriously difficult to optimize. However, we should be able to index the WHERE clause. Consider trying the following compound index:

CREATE INDEX idx ON phototag (photoid, tagid);

This index covers the WHERE clause, as well as the SELECT, and should be usable while aggregating with GROUP BY.

Another totally different approach would be to create a summary table. This table would take the following structure:

photoid | count
10000   | 25
10001   | 300
...     | ...

Every time you insert a new record into phototag which is not one of the blacklisted tagid values, you would also update the summary table. For example, assuming you had just inserted a whitelisted tag for photoid 10001, you would execute the following update:

UPDATE summary
SET count = count   1
WHERE photoid = 10000;

Now, with this summary table in hand, your original query just becomes:

SELECT photoid
FROM summary
WHERE photoid != 10009
ORDER BY count DESC
LIMIT 24;

Note that there is no aggregation needed. Instead, the aggregation is happening all along the way in the summary table.

CodePudding user response:

If you change what you mean by "similar", you can reduce the amount of work the database engine needs to do. (Mostly, avoiding sorting the "score" for all or most of you 4 million photos.

For example, you could pick all photos that share your target photo's one "rarest" tag.

SELECT *
  FROM phototag
 WHERE photoid != 10009
   AND tagid = (
         SELECT tagid
           FROM phototag
          WHERE tagid IN (
                  SELECT tagid
                    FROM phototag
                   WHERE photoid = 10009
                )
       GROUP BY tagid
       ORDER BY COUNT(*), tagid
          LIMIT 1
       )

With indexes on both (tagid, photoid) and (photoid, tagid).

This massively reduces the number of rows that need to be processed and makes that processing easier.

But...

It is not what you originally asked for, and it's possible that it will return zero matching photos.

What it is...

An example of a different definition of "similar" that can reduce the amount of data being processed, leading to faster queries.

So, there's your trade off...

Either:

  • you accept the slow performance of a query that reads and re-sorts pretty much the whole table
  • you redefine "similar" to something "easier to solve" but probably "less good"
  • Related