Query becomes slow when two conditions are both applied-CodePudding

I have a users table with bio field and "n:n" relationship with itself via the followers table so each user U can follow many other users. My user search query is super slow.

All queries get the first 20 search results (limit 20).
Searching for users who have "founder" in their bio takes 0.3s.
searching for users who follow X takes 0.03s.
Searching for users who have "founder" in their bio AND follow X takes 118s.

Query for both filters:

select distinct `twitter_user`.`id`
from `twitter_user`
         join `twitter_user_follower`
              on (
                          `twitter_user_follower`.`follower_twitter_user_id` =
                          `twitter_user`.`id`
                      and `twitter_user_follower`.`twitter_user_id` = 4899565692
                      and `twitter_user_follower`.`follower_download_id` = 7064
                  )
where MATCH(twitter_user.description) AGAINST('founder')
limit 20 offset 0

Table definitions :

CREATE TABLE `twitter_user` (
  `id` bigint NOT NULL,
  `name` varchar(128) NOT NULL,
  `email` varchar(128) DEFAULT NULL,
  `screen_name` varchar(128) DEFAULT NULL,
  `location` varchar(256) DEFAULT NULL,
  `description` varchar(512) DEFAULT NULL,
  `url` varchar(256) DEFAULT NULL,
  `is_protected` bit(1) DEFAULT NULL,
  `followers_count` int DEFAULT NULL,
  `is_verified` bit(1) DEFAULT NULL,
  `friends_count` int DEFAULT NULL,
  `created_at` bigint DEFAULT NULL,
  `favourites_count` int DEFAULT NULL,
  `utc_offset` int DEFAULT NULL,
  `time_zone` varchar(128) DEFAULT NULL,
  `statuses_count` int DEFAULT NULL,
  `profile_image_url` varchar(512) DEFAULT NULL,
  `internal_json` json DEFAULT NULL,
  `row_timestamp` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `twitter_user_username_index` (`screen_name`),
  KEY `twitter_user_ts` (`row_timestamp`),
  FULLTEXT KEY `twitter_user_description_ft_index` (`description`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

CREATE TABLE `twitter_user_follower` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `twitter_user_id` bigint NOT NULL,
  `follower_twitter_user_id` bigint NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `follower_download_id` bigint DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `twitter_user_follower_twitter_user_id_index` (`twitter_user_id`),
  KEY `twitter_user_follower_follower_download_id_index` (`follower_download_id`),
  KEY `tuf_twitter_user_follower_download_key` (`twitter_user_id`,`follower_download_id`,`follower_twitter_user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=68494675 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Explain output :

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	twitter_user	NULL	fulltext	PRIMARY,twitter_user_username_index,twitter_user_ts,twitter_user_description_ft_index	twitter_user_description_ft_index	0	const	1	100.00	Using where; Ft_hints: no_ranking; Using temporary
1	SIMPLE	twitter_user_follower	NULL	ref	twitter_user_follower_twitter_user_id_index,twitter_user_follower_follower_download_id_index,tuf_twitter_user_follower_download_key	tuf_twitter_user_follower_download_key	25	const,const,si_data_db.twitter_user.id	1	100.00	Using index; Distinct

Tree output:

-> Limit: 20 row(s)  (cost=4.77..4.77 rows=1)
-> Table scan on <temporary>  (cost=2.51..2.51 rows=1)
    -> Temporary table with deduplication  (cost=4.77..4.77 rows=1)
        -> Limit table size: 20 unique row(s)
            -> Nested loop inner join  (cost=2.16 rows=1)
                -> Filter: (match twitter_user.`description` against (''founder''))  (cost=1.06 rows=1)
                    -> Full-text index search on twitter_user using twitter_user_description_ft_index (description=''founder'')  (cost=1.06 rows=1)
                -> Limit: 1 row(s)  (cost=1.10 rows=1)
                    -> Covering index lookup on twitter_user_follower using tuf_twitter_user_follower_download_key (twitter_user_id=4899565692, follower_download_id=7064, follower_twitter_user_id=twitter_user.id)  (cost=1.10 rows=1)

This query is still slow:

SELECT `follower`.`follower_twitter_user_id`
FROM (
         SELECT `follower_twitter_user_id`
         FROM `twitter_user_follower`
         WHERE `twitter_user_id` = 4899565692
           AND `follower_download_id` = 7440
     ) AS follower
         JOIN `twitter_user` ON `follower`.`follower_twitter_user_id` =  `twitter_user`.`id`
WHERE MATCH(twitter_user.description) AGAINST(' founder' IN BOOLEAN MODE)
limit 20 offset 0;

Explain output:

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	twitter_user	NULL	fulltext	PRIMARY,twitter_user_description_ft_index	twitter_user_description_ft_index	0	const	1	100.00	Using where; Ft_hints: no_ranking
1	SIMPLE	twitter_user_follower	NULL	ref	twitter_user_follower_twitter_user_id_index,twitter_user_follower_follower_download_id_index,tuf_twitter_user_follower_download_key	tuf_twitter_user_follower_download_key	25	const,const,si_data_db.twitter_user.id	1	100.00	Using index

Explain Analyze output:

-> Limit: 20 row(s)  (cost=2.16 rows=1) (actual time=3779.933..91032.297 rows=20 loops=1)
    -> Nested loop inner join  (cost=2.16 rows=1) (actual time=3779.932..91032.285 rows=20 loops=1)
        -> Filter: (match twitter_user.`description` against (' founder' in boolean mode))  (cost=1.06 rows=1) (actual time=94.166..90001.280 rows=198818 loops=1)
            -> Full-text index search on twitter_user using twitter_user_description_ft_index (description=' founder')  (cost=1.06 rows=1) (actual time=94.163..89909.371 rows=198818 loops=1)
        -> Covering index lookup on twitter_user_follower using tuf_twitter_user_follower_download_key (twitter_user_id=4899565692, follower_download_id=7440, follower_twitter_user_id=twitter_user.id)  (cost=1.10 rows=1) (actual time=0.005..0.005 rows=0 loops=198818)

users table is 125GB and followers table is 5GB on disk. If I convert query to two selects joined, it runs in 45s:

select t1.id from
(select follower_twitter_user_id as id from `twitter_user_follower`
 where (
                   `twitter_user_follower`.`twitter_user_id` = 4899565692
               and `twitter_user_follower`.`follower_download_id` = 8039
           )) t1
inner join
(
    select `twitter_user`.`id`
    from `twitter_user` where MATCH(twitter_user.description) AGAINST(' create' IN BOOLEAN MODE)
) t2 on t1.id = t2.id
limit 20 offset 0

Explain output:

 -> Limit: 20 row(s)  (cost=2.18 rows=1)
    -> Nested loop inner join  (cost=2.18 rows=1)
        -> Filter: (match twitter_user.`description` against (' create' in boolean mode))  (cost=1.08 rows=1)
            -> Full-text index search on twitter_user using twitter_user_description_ft_index (description=' create')  (cost=1.08 rows=1)
        -> Covering index lookup on twitter_user_follower using tuf_twitter_user_follower_download_key (twitter_user_id=4899565692, follower_download_id=8039, follower_twitter_user_id=twitter_user.id)  (cost=1.10 rows=1)

Why does it take 45s to run?

CodePudding user response：

Try the following. CHange

MATCH(twitter_user.description) AGAINST('founder')

MATCH(twitter_user.description) AGAINST(' founder' IN BOOLEAN MODE)

Also, the DISTINCT may not be necessary.

IODKU?

Buried in a comment I see a DELETE INSERT that makes for a lot of churn in part of the table.

InnoDB's FULLTEXT may not be efficient in such cases
If most of the rows aren't changing, then the delete insert is inefficient and cause more churn than may be necessary.

Look at INSERT ... ON DUPLICATE KEY UPDATE ... to see if that can be used instead of the delete insert. If most of the rows don't change, then this will probably be faster and probably have less impact on things such as the Fulltext index.

If that Delete does actually remove some rows, then the IODKU (upsert) won't suffice. A second pass using something like INSERT ... SELECT ... LEFT JOIN may be the solution for inserting "new" rows. (I allude to such in a different context here: Normalization; see SQL#1.)

Periodically (weekly?), run OPTIMIZE TABLE. But keep some timings to see if this step really helps.

2 steps

First, I am still unclear about the data you receive every hour. Is it just info about one user? Does it include rows to delete with some indication that they are to be deleted instead of updated? Etc.

If it is a single user...

DELETE only the rows that need deleting. This involves a multi-table Delete with a LEFT JOIN to see what is missing.
INSERT ... SELECT ... LEFT JOIN ... to either insert or update the existing rows.

CodePudding user response：

can you please try this and post the EXPLAIN to us.

SELECT `follower`.`follower_twitter_user_id`
FROM (
  SELECT `follower_twitter_user_id`
  FROM `twitter_user_follower`
  WHERE `twitter_user_id` = 4899565692
    AND `follower_download_id` = 7064
) AS follower
JOIN `twitter_user` ON `follower`.`follower_twitter_user_id` =  `twitter_user`.`id`
MATCH(twitter_user.description) AGAINST(' founder' IN BOOLEAN MODE)
limit 20 offset 0;

CodePudding user response：

An option to try, in an attempt to minimise overheads and maximise any short-circuiting

SELECT
   `twitter_user`.`id`
FROM
  `twitter_user`
WHERE
  MATCH(twitter_user.description) AGAINST (' founder' IN BOOLEAN MODE)
  AND
  EXISTS (
    SELECT
      *
    FROM
      `twitter_user_follower`
    WHERE
          `follower_twitter_user_id` = `twitter_user`.`id
      AND `twitter_user_id` = 4899565692
      AND `follower_download_id` = 7064
  )
LIMIT 20
OFFSET 0