I am having performance issues with a query, I have 21 million records across the table, and 2 of the tables I'm looking in here have 8 million each; individually, they are very quick. But I've done a query that, in my opinion, isn't very good, but it's the only way I know how to do it.
This query takes 65 seconds, I need to get it under 1 second and I think it's possible if I don't have all the SELECT queries, but once again, I am not sure how else to do it with my SQL knowledge.
Database server version is MariaDB 10.6.
SELECT
pa.`slug`,
(
SELECT
SUM(`impressions`)
FROM `rh_pages_gsc_country`
WHERE `page_id` = pa.`page_id`
AND `country` = 'aus'
AND `date_id` IN
(
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
)
) as au_impressions,
(
SELECT
SUM(`clicks`)
FROM `rh_pages_gsc_country`
WHERE `page_id` = pa.`page_id`
AND `country` = 'aus'
AND `date_id` IN
(
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
)
) as au_clicks,
(
SELECT
COUNT(`keywords_id`)
FROM `rh_pages_gsc_keywords`
WHERE `page_id` = pa.`page_id`
AND `date_id` IN
(
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
)
) as keywords,
(
SELECT
AVG(`position`)
FROM `rh_pages_gsc_keywords`
WHERE `page_id` = pa.`page_id`
AND `date_id` IN
(
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
)
) as avg_pos,
(
SELECT
AVG(`ctr`)
FROM `rh_pages_gsc_keywords`
WHERE `page_id` = pa.`page_id`
AND `date_id` IN
(
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
)
) as avg_ctr
FROM `rh_pages` pa
WHERE pa.`site_id` = 13
ORDER BY au_impressions DESC, keywords DESC, slug DESC
If anyone can help, I don't think the table structure is needed here as it's basically shown in the query, but here is a photo of the constraints and table types.
Anyone that can help is greatly appreciated.
CodePudding user response:
Your query is aggregating (summarizing) rows from two different detail tables, rh_pages_gsc_country
and rh_pages_gsc_keywords
, and doing so for a particular date range. And it has a lot of correlated subqueries.
The first steps in your path to better performance are
- Converting your correlated subqueries to independent subqueries, then JOINing them.
- Writing one subquery for each detail table, rather than one for each column you need summarized.
You mentioned you've been struggling with this. The concept I hope you learn from this answer is this: you can often refactor away your correlated subqueries if you can come up with independent subqueries that give the same results, and then join them together. If you mention subqueries in your SELECT clause -- SELECT ... (SELECT whatever) whatever ...
-- you probably have an opportunity to do this refactoring.
Here goes. First you need a subquery for your date range. You have this one right, just repeated.
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
Next you need a subquery for rh_pages_gsc_country
. It is a modification of what you have. We'll fetch both SUMs in one subquery.
SELECT SUM(`impressions`) impressions,
SUM(`clicks`) clicks,
page_id, date_id
FROM `rh_pages_gsc_country`
WHERE `country` = 'aus'
GROUP BY page_id, date_id
See how this goes? This subquery yields a virtual table with exactly one row for every combination of page_id
and date_id
, containing the number of impressions and the number of clicks.
Next, let's join the subqueries together in a main query. This yields some columns of your result set.
SELECT pa.slug, country.impressions, country.clicks
FROM rh_pages pa
JOIN (
SELECT SUM(`impressions`) impressions,
SUM(`clicks`) clicks,
page_id, date_id
FROM `rh_pages_gsc_country`
WHERE `country` = 'aus' -- constant for country code
GROUP BY page_id, date_id
) country ON country.page_id = pa.page_id
JOIN (
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
) dates ON dates.date_id = country.date_id
WHERE pa.site_id = 13 -- constant for page id
ORDER BY country.impressions DESC
This runs through the rows of rh_pages_gsc_dates
and rh_pages_gsc_country
just once to satisfy your query. So, faster.
Finally let's do the same thing for your rh_pages_gsc_keywords
table's summary.
SELECT pa.slug, country.impressions, country.clicks,
keywords.keywords, keywords.avg_pos, keywords.avg_ctr
FROM rh_pages pa
JOIN (
SELECT SUM(`impressions`) impressions,
SUM(`clicks`) clicks,
page_id, date_id
FROM `rh_pages_gsc_country`
WHERE `country` = 'aus' -- constant for country code
GROUP BY page_id, date_id
) country ON country.page_id = pa.page_id
JOIN (
SELECT SUM(`keywords_id`) keywords,
AVG(`position`) position,
AVG(`ctr`) avg_ctr,
page_id, date_id
FROM `rh_pages_gsc_keywords`
GROUP BY page_id, date_id
) keywords ON keywords.page_id = pa.page_id
JOIN (
SELECT `date_id`
FROM `rh_pages_gsc_keywords`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
) dates ON dates.date_id = country.date_id
AND dates.date_id = keywords.date_id
WHERE pa.site_id = 13 -- constant for page id
ORDER BY impressions DESC, keywords DESC, slug DESC
This will almost certainly be faster than what you have now. If it's fast enough, great. If not, please don't hesitate to ask another question for help, tagging it query-optimization. We will need to see your table definitions, your index definitions, and the output of EXPLAIN. Please read this before asking a followup question.
I did not, repeat not, debug any of this. That's up to you.
CodePudding user response:
Do NOT normalize any column that will be regularly used in a "range scan", such as date
. The following is terribly slow:
AND `date_id` IN (
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH
AND NOW() )
It also consumes extra space to have BIGINT
(8 bytes) pointing to a DATE
(5 bytes).
Once you move the date
to the various tables, the subqueries simplify, such as
SELECT AVG(`position`)
FROM `rh_pages_gsc_keywords`
WHERE `page_id` = pa.`page_id`
AND `date_id` IN (
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH
AND NOW() )
becomes
SELECT AVG(`position`)
FROM `rh_pages_gsc_keywords`
WHERE `page_id` = pa.`page_id`
AND `date` >= NOW() - INTERVAL 12 MONTH
I'm assuming that nothing after "NOW" has yet been stored. If there are dates in the future, then add
AND `date` < NOW()
Each table will probably need a new index, such as
INDEX(page_id, date) -- in that order
(Yes, the "JOIN" suggestion by others is a good one. It's essentially orthogonal to my suggestions above and below.)
After you have made those changes, if the performance is not good enough, we can discuss Summary Tables