I am currently working on a project that has 2 very large sql tables Users and UserDocuments having around million and 2-3 millions records respectively. I have a query that will return the count of all the documents that each indvidual user has uploaded provided the document is not rejected. A user can have multiple documents against his/her id. My current query:-
SELECT
u.user_id,
u.name,
u.date_registered,
u.phone_no,
t1.docs_count,
t1.last_uploaded_on
FROM
Users u
JOIN(
SELECT user_id,
MAX(updated_at) AS last_uploaded_on,
SUM(CASE WHEN STATUS != 2 THEN 1 ELSE 0 END) AS docs_count
FROM
UserDocuments
WHERE
user_id IN(
SELECT
user_id
FROM
Users
WHERE
region_id = 1 AND city_id = 8 AND user_type = 1 AND user_suspended = 0 AND is_enabled = 1 AND verification_status = -1
) AND document_id IN('1', '2', '3', '4', '10', '11')
GROUP BY
user_id
ORDER BY
user_id ASC
) t1
ON
u.user_id = t1.user_id
WHERE
docs_count < 6 AND region_id = 1 AND city_id = 8 AND user_type = 1 AND user_suspended = 0 AND is_enabled = 1 AND verification_status = -1
LIMIT 1000, 100
Currently the query is taking very long around 20 secs to return data with indexes. can someone suggest some tweaks in the follwing query to gain some more preformance out of it.
CodePudding user response:
SELECT
u.user_id,
max( u.name ) name,
max( u.date_registered ) date_registered,
max( phone_no ) phone_no,
MAX(d.updated_at) last_uploaded_on,
SUM(CASE WHEN d.STATUS != 2
THEN 1 ELSE 0 END) docs_count
FROM
Users u
JOIN UserDocuments d
ON u.user_id = d.user_id
AND d.document_id IN ('1', '2', '3', '4', '10', '11')
WHERE
u.region_id = 1
AND u.city_id = 8
AND u.user_type = 1
AND u.user_suspended = 0
AND u.is_enabled = 1
AND u.verification_status = -1
GROUP BY
u.user_id
HAVING
SUM(CASE WHEN d.STATUS != 2
THEN 1 ELSE 0 END) < 6
ORDER BY
u.user_id ASC
LIMIT
1000, 100
Have indexes on your tables as
user ( region_id, city_id, user_type, user_suspended, is_enabled, verification_status )
UserDocuments ( user_id, document_id, status, updated_at )
You are adding extra querying from the user table to both the inner and outer joins which might be killing it. Having an index on your critical "WHERE" components by user will pre-filter that set out. Only from that will it join to the UserDocuments table. By having the outer query get the counts() at the top level query.
Since the users name, registered and phone dont change per user, applying max() to each respectively prevents the need of adding those columns to the group by clause.
The index on the documents table on only the columns needed to confirm status and document_id and when last updated. This prevents the engine from having to go to the raw data pages as it can get the qualifying details directly from the index parts saving you time too.
CodePudding user response:
LIMIT
withoutORDER BY
does not make sense.- An
ORDER BY
in a 'derived table' is ignored. - Will you really have thousands of result rows? (I see the "offset of 1000".)
- Use
JOIN
instead ofIN ( SELECT ... )
- What indexes do you have? I suggest
INDEX(region_id, city_id, user_id)
CASE WHEN d.STATUS != 2 THEN 1 ELSE 0 END
can be shortened tod.status != 2
.- How many different values of
status
are there? If only two, then flip the test to
d.status = 1`.