Home > OS >  Replace "OR" on 2 indexes with a faster solution (UNION?)
Replace "OR" on 2 indexes with a faster solution (UNION?)

Time:09-23

I'm querying shopping-carts in a shop-system, like:

DROP TABLE IF EXISTS c;
CREATE TABLE c (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `user` int(10) unsigned DEFAULT NULL,
  `email` VARCHAR(255) NOT NULL DEFAULT '', 
  `number` VARCHAR(20) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `user`(`user`),
  KEY `email`(`email`),
  UNIQUE KEY `number`(`number`)
) ENGINE=InnoDB;

INSERT INTO c SET user=1, email="[email protected]", number="00001";
INSERT INTO c SET user=2, email="[email protected]", number="00002";
INSERT INTO c SET user=3, email="[email protected]", number="00003";
INSERT INTO c SET user=4, email="[email protected]", number="00004";
INSERT INTO c SET user=1, email="[email protected]", number="00005";

I need to query records of c with a column which shows the number of carts which have the same user OR the same email. So I do:

SELECT c.number, 
       (SELECT COUNT(DISTINCT (id)) FROM c AS c2
                  WHERE c2.email = c.email OR c2.user = c.user
       ) AS ordercount
FROM c;
   

 -------- ------------ 
| number | ordercount |
 -------- ------------ 
| 00001  |          3 |
| 00002  |          1 |
| 00003  |          1 |
| 00004  |          3 |
| 00005  |          3 |
 -------- ------------ 

This works, but the problem is that the OR is very slow, because MySQL/MariaDB doesn't use any key in the subquery:

EXPLAIN SELECT c.number, 
               (SELECT COUNT(DISTINCT (id)) FROM c AS c2
                   WHERE c2.email = c.email OR c2.user = c.user
               ) AS ordercount
        FROM c;

 ---- -------------------- ------- ------------ ------ --------------------------- --    ---- --------- ------ ------ ---------- ------------- 
| id | select_type        | table | partitions | type | possible_keys             | key  | key_len | ref  | rows | filtered | Extra       |
 ---- -------------------- ------- ------------ ------ --------------------------- ------ --------- ------ ------ ---------- ------------- 
|  1 | PRIMARY            | c     | NULL       | ALL  | NULL                      | NULL | NULL    | NULL |    5 |   100.00 | NULL        |
|  2 | DEPENDENT SUBQUERY | c2    | NULL       | ALL  | PRIMARY,number,user,email | NULL | NULL    | NULL |    5 |    36.00 | Using where |
 ---- -------------------- ------- ------------ ------ --------------------------- ------ --------- ------ ------ ---------- ------------- 

Even forcing the index doesn't make the DB using it:

EXPLAIN SELECT c.number, 
               (SELECT COUNT(DISTINCT (id)) FROM c AS c2 FORCE INDEX(email, user)
                  WHERE c2.email = c.email OR c2.user = c.user
               ) AS ordercount
        FROM c;

 ---- -------------------- ------- ------------ ------ --------------------------- --    ---- --------- ------ ------ ---------- ------------- 
| id | select_type        | table | partitions | type | possible_keys             | key  | key_len | ref  | rows | filtered | Extra       |
 ---- -------------------- ------- ------------ ------ --------------------------- ------ --------- ------ ------ ---------- ------------- 
|  1 | PRIMARY            | c     | NULL       | ALL  | NULL                      | NULL | NULL    | NULL |    5 |   100.00 | NULL        |
|  2 | DEPENDENT SUBQUERY | c2    | NULL       | ALL  | PRIMARY,number,user,email | NULL | NULL    | NULL |    5 |    36.00 | Using where |
 ---- -------------------- ------- ------------ ------ --------------------------- ------ --------- ------ ------ ---------- ------------- 

Using either column "email" or column "user" works fine, the key is used:

EXPLAIN SELECT c.number, 
               (SELECT COUNT(DISTINCT (id)) FROM c AS c2 WHERE c2.email = c.email) AS ordercount
        FROM c;

 ---- -------------------- ------- ------------ ------ --------------------------- ------- --------- -------------- ------ ---------- ------------- 
| id | select_type        | table | partitions | type | possible_keys             | key   | key_len | ref          | rows | filtered | Extra       |
 ---- -------------------- ------- ------------ ------ --------------------------- ------- --------- -------------- ------ ---------- ------------- 
|  1 | PRIMARY            | c     | NULL       | ALL  | NULL                      | NULL  | NULL    | NULL         |    5 |   100.00 | NULL        |
|  2 | DEPENDENT SUBQUERY | c2    | NULL       | ref  | PRIMARY,number,user,email | email | 767     | test.c.email |    3 |   100.00 | Using index |
 ---- -------------------- ------- ------------ ------ --------------------------- ------- --------- -------------- ------ ---------- ------------- 

The problem is that the query runs on large table with about 500.000 entries, making the query taking about 30 seconds only to query a subset of 50 records. Running the query only with the match for "email" or only with the match for "user" it just takes about 1 second for 50 records.

So I need to optimize the query. I tried to change the OR into an UNION:

SELECT c.number, 
(SELECT COUNT(DISTINCT (id)) FROM 
    ((SELECT u1.id FROM c AS u1 WHERE
     u1.email = c.email
    )
    UNION DISTINCT
    (SELECT u2.id FROM c AS u2 WHERE
    u2.user = c.user
    )) AS u2
) AS ordercount
FROM c;

but I'm getting the error: ERROR 1054 (42S22): Unknown column 'c.email' in 'where clause'

Any idea how to make this query using indexes to be faster?

CodePudding user response:

This is an alternative approach using two left joins:

select c.*,
       count(distinct coalesce(ce.id, cu.id))
from c left join
     c ce
     on c.email = ce.email left join
     c cu
     on c.user = cu.user and not cu.email <=> ce.email
group by c.id;

This can use separate indexes on c(user) and c(email).

Basically, this joins along the two separate dimensions, and then brings them together for the count(distinct). There are some worse case scenarios where there might be lots of matches along both dimensions. However, in many cases this might prove to work quite well because it can use the indexes rather than scanning the entire table for each row.

CodePudding user response:

(I assume "c" means "cart".)

(starting over)

Since number is UNIQUE, it may as well be the PRIMARY KEY. Also get rid of id.

CREATE FUNCTION Ct(_user INT, _email VARCHAR(255))
    RETURNS VARCHAR(20)
RETURN
    SELECT COUNT(DISTINCT number)
        FROM
            ( SELECT number
                FROM c
                WHERE user = _user
            ) UNION ALL
            ( SELECT number
                FROM c
                WHERE email = _email
            );

Then do

SELECT number, Ct(user, email)
    FROM c;

Note that I avoided the double-DISTINCT. And, since the PK is implicitly part of each secondary index, the inner Selects have "covering" indexes.

  • Related