Individual Subqueries are fast. Adding 1 Subquery is fast. However, adding more than 1 Subquery toge-CodePudding

I have the following query to select users who have posted at least once each week for the past 4 weeks:

SELECT DISTINCT(user_id) FROM `posts` WHERE

    user_id IN (SELECT DISTINCT(user_id) FROM `posts` WHERE post_date > ((UNIX_TIMESTAMP()- (604800*0) ) - 604800) AND post_date <= (UNIX_TIMESTAMP()- (604800*0))) AND
    user_id IN (SELECT DISTINCT(user_id) FROM `posts` WHERE post_date > ((UNIX_TIMESTAMP()- (604800*1) ) - 604800) AND post_date <= (UNIX_TIMESTAMP()- (604800*1))) AND
    user_id IN (SELECT DISTINCT(user_id) FROM `posts` WHERE post_date > ((UNIX_TIMESTAMP()- (604800*2) ) - 604800) AND post_date <= (UNIX_TIMESTAMP()- (604800*2))) AND
    user_id IN (SELECT DISTINCT(user_id) FROM `posts` WHERE post_date > ((UNIX_TIMESTAMP()- (604800*3) ) - 604800) AND post_date <= (UNIX_TIMESTAMP()- (604800*3)))

If I run the query with just 1 of any of those 4 sub-queries, it is fast.

If I run any of those 4 sub-queries individually, they are fast.

However, the moment I add more than 1 of those subqueries together, it crashes MySQL.

How can I fix this so it works considering the individual parts are fast?

(I should add that in this particular case, I cannot add any more indexes, so the solution should not involve indexing).

CodePudding user response：

Try to EXPLAIN <query> to see the execution plan, each subquery could read many rows of data.

But I suggest to re-write the query for this particular purpose, it needs to GROUP BY user and COUNT DISTINCT weeks as 4, read the table once is sufficient.

SELECT user_id
FROM `posts` 
WHERE post_date > ((UNIX_TIMESTAMP()- (604800*3) ) - 604800)
GROUP BY user_id
HAVING COUNT(DISTINCT 
    (CASE WHEN post_date > ((UNIX_TIMESTAMP()- (604800*0) ) - 604800) AND post_date <= (UNIX_TIMESTAMP()- (604800*0)) THEN 1
        WHEN post_date > ((UNIX_TIMESTAMP()- (604800*1) ) - 604800) AND post_date <= (UNIX_TIMESTAMP()- (604800*1)) THEN 2
        WHEN post_date > ((UNIX_TIMESTAMP()- (604800*2) ) - 604800) AND post_date <= (UNIX_TIMESTAMP()- (604800*2)) THEN 3
        ELSE 4 END)) = 4

About the query

WHERE post_date > ((UNIX_TIMESTAMP()- (604800*3) ) - 604800) filter applicable data only
CASE WHEN post_date > ((UNIX_TIMESTAMP ... transform the post_date into week 1, 2, 3 or 4

Now if there are data like below

 --------- ------ 
| user_id | week |
 --------- ------ 
|       1 |    1 |
|       1 |    2 |
|       1 |    3 |
|       1 |    3 |
|       2 |    1 |
|       2 |    2 |
|       2 |    3 |
|       2 |    4 |
 --------- ------

There are four rows for each user_id, but user 1 appears in week 1,2,3, COUNT DISTINCT week is 3, user 2 will count 4

CodePudding user response：

I think you may be able to do this in one pass through all those posts rows. Here's a suggestion.

I assume your post_date column is some kind of number, not a DATETIME or TIMESTAMP.

First, we need an expression to determine how many weeks ago each post occurred. That's this. It yields an integer. 0 for posts less than a week old, 1 for posts between one and two weeks old etc.

TRUNCATE( (UNIX_TIMESTAMP() - date_time ) / 608400, 0)

Next we need a subquery to count posts and users by week.

    SELECT 
     COUNT (*) post_count,
     user_id,
     TRUNCATE((UNIX_TIMESTAMP() - date_time ) / 608400, 0) age
    FROM posts
   WHERE date_time >= UNIX_TIMESTAMP() - (4 * 608400)
     AND date_time <= UNIX_TIMESTAMP()
    GROUP BY user_id, age

This result set contains one row for each user_id and week.

Finally, summarize the subquery and use HAVING to choose user_id values that show up in it four times.

SELECT user_id
  FROM ( /* that subquery goes here */ ) ages
 GROUP BY user_id
HAVING COUNT(*) = 4

This is efficient because it can gather all four weeks' work of data from the posts table in one scan. If you happen to have an index on (date_time, user_id) MySql does it with an index scan and it will be surprisingly fast.

It's also sweet if your requirement changes to, say, six weeks. Change the 4 constants to 6 and you can use the same query.

Putting it together:

SELECT user_id FROM (
  SELECT 
     COUNT (*) post_count,
     user_id,
     TRUNCATE((UNIX_TIMESTAMP() - date_time) / 608400, 0) age
    FROM posts
   WHERE date_time >= UNIX_TIMESTAMP() - (4 * 608400)
     AND date_time <= UNIX_TIMESTAMP()
    GROUP BY user_id, age
   ) ages
 GROUP BY user_id
HAVING COUNT(*) = 4