Can someone optimize this SQL query?-CodePudding

I am currently working on a project that has 2 very large sql tables Users and UserDocuments having around million and 2-3 millions records respectively. I have a query that will return the count of all the documents that each indvidual user has uploaded provided the document is not rejected. A user can have multiple documents against his/her id. My current query:-

SELECT
    u.user_id,
    u.name,
    u.date_registered,
    u.phone_no,
    t1.docs_count,
    t1.last_uploaded_on
FROM
    Users u
JOIN(
    SELECT user_id,
        MAX(updated_at) AS last_uploaded_on,
        SUM(CASE WHEN STATUS != 2 THEN 1 ELSE 0 END) AS docs_count
FROM
    UserDocuments
WHERE
    user_id IN(
    SELECT
        user_id
    FROM
        Users
    WHERE
        region_id = 1 AND city_id = 8 AND user_type = 1 AND user_suspended = 0 AND is_enabled = 1 AND verification_status = -1
) AND document_id IN('1', '2', '3', '4', '10', '11')
GROUP BY
    user_id
ORDER BY
    user_id ASC
) t1
ON
    u.user_id = t1.user_id
WHERE
    docs_count < 6 AND region_id = 1 AND city_id = 8 AND user_type = 1 AND user_suspended = 0 AND is_enabled = 1 AND verification_status = -1
LIMIT 1000, 100

Currently the query is taking very long around 20 secs to return data with indexes. can someone suggest some tweaks in the follwing query to gain some more preformance out of it.

CodePudding user response：

SELECT
        u.user_id,
        max( u.name ) name,
        max( u.date_registered ) date_registered,
        max( phone_no ) phone_no,
        MAX(d.updated_at) last_uploaded_on,
        SUM(CASE WHEN d.STATUS != 2 
                THEN 1 ELSE 0 END) docs_count
    FROM
        Users u
            JOIN UserDocuments d
                ON u.user_id = d.user_id
                AND d.document_id IN ('1', '2', '3', '4', '10', '11')
    WHERE
            u.region_id = 1 
        AND u.city_id = 8 
        AND u.user_type = 1 
        AND u.user_suspended = 0 
        AND u.is_enabled = 1 
        AND u.verification_status = -1
    GROUP BY
        u.user_id
    HAVING
        SUM(CASE WHEN d.STATUS != 2 
                THEN 1 ELSE 0 END) < 6
    ORDER BY
        u.user_id ASC
    LIMIT 
        1000, 100

Have indexes on your tables as

user ( region_id, city_id, user_type, user_suspended, is_enabled, verification_status )
UserDocuments ( user_id, document_id, status, updated_at )

You are adding extra querying from the user table to both the inner and outer joins which might be killing it. Having an index on your critical "WHERE" components by user will pre-filter that set out. Only from that will it join to the UserDocuments table. By having the outer query get the counts() at the top level query.

Since the users name, registered and phone dont change per user, applying max() to each respectively prevents the need of adding those columns to the group by clause.

The index on the documents table on only the columns needed to confirm status and document_id and when last updated. This prevents the engine from having to go to the raw data pages as it can get the qualifying details directly from the index parts saving you time too.

CodePudding user response：

LIMIT without ORDER BY does not make sense.
An ORDER BY in a 'derived table' is ignored.
Will you really have thousands of result rows? (I see the "offset of 1000".)
Use JOIN instead of IN ( SELECT ... )
What indexes do you have? I suggest INDEX(region_id, city_id, user_id)
CASE WHEN d.STATUS != 2 THEN 1 ELSE 0 END can be shortened to d.status != 2.
How many different values of status are there? If only two, then flip the test to d.status = 1`.