I am looking for some help in regards to optimising the following query due to slow performance. The tables and records do not exist on my machine (helping a friend) hence I do not have the insert records and all I know is that the tables are large and contain millions of rows of data. I have been sent the create statements in order to show the table relationships with the foreign keys etc.
The following is the query:
SELECT
tutor.tutor_id,
tutor.subject_id,
course.title,
SUBSTR(score.date, 4, 2) AS month,
AVG(score.score)
FROM
tutor
LEFT JOIN course ON tutor.course_id = course.course_id
LEFT JOIN score ON tutor.course_id = score.course_id
WHERE
course.course_id IS NOT NULL
AND score.date > '01-MAR-2020'
AND score.student_id in (SELECT student_id FROM students)
GROUP BY
tutor.tutor_id,
tutor.subject_id,
course.title,
SUBSTR(score.date, 4, 2)
;
Having looked through the above query, I understand that, there is a sub select query within the where clause that could be slowing the query down and this would be first step to changeover to an INNER JOIN.
I am trying to think of what else I could do, to quicken the query such as selecting the fields (which it already has). Only other thing would be to add a LIMIT to sample query results (can't run this to test myself) and also add an EXPLAIN PLAN to understand what's causing the delays in the results coming back so slow. The other option is to check if the columns in the joins have been indexed.
CREATE STATEMENTS
CREATE TABLE Score (
score_id integer,
student_id integer,
course_id integer,
date date,
score integer,
PRIMARY KEY(score_id),
FOREIGN KEY(student_id) REFERENCES Students(student_id),
FOREIGN KEY(course_id) REFERENCES Course(course_id)
);
CREATE TABLE Student (
student_id integer,
first_name varchar,
last_name varchar,
group_id integer,
PRIMARY KEY(student_id),
FOREIGN KEY(group_id) REFERENCES Groups(group_id)
);
CREATE TABLE Groups (
group_id integer,
name varchar,
PRIMARY KEY(group_id)
);
CREATE TABLE Course (
course_id integer,
title varchar,
PRIMARY KEY(course_id)
);
CREATE TABLE Tutor (
course_id integer,
tutor_id integer,
group_id integer,
PRIMARY KEY(tutor_id),
FOREIGN KEY(course_id) REFERENCES Course(course_id),
FOREIGN KEY(group_id) REFERENCES Groups(group_id)
);
I would appreciate it, if somebody could help me understand, if I have missed anything or if I am along the right lines in trying to optimise the query that I have.
CodePudding user response:
- You query does not use
LEFT OUTER JOIN
as, in theWHERE
filter,score.date > '01-MAR-2020'
andcourse.course_id IS NOT NULL
can never matchNULL
values so theOUTER
joined rows will be eliminated. - Given that then
WHERE course.course_id IS NOT NULL
is redundant as the join clauseON tutor.course_id = course.course_id
also enforces theNOT NULL
condition. '01-MAR-2020'
is not a date; it is a string literal that looks like a date and relies on the SQL engine performing an implicit cast from string-to-date. UseDATE '2020-03-01'
instead.score.student_id in (SELECT student_id FROM students)
is enforced by theFOREIGN KEY
constraint. You don't need to query thestudents
table, you only need to check that the value isNOT NULL
.SUBSTR(score.date, 4, 2)
relies on an implicit date-to-string conversion. Either useEXTRACT(MONTH FROM score.date)
to directly get the month number (if you want to group 2022-03 with 2020-03) or useTRUNC(score.date, 'MM')
(if 2020-03 should not be grouped with 2021-03).
Which makes your query:
SELECT tutor.tutor_id,
tutor.subject_id,
course.title,
TRUNC(score.date, 'MM') AS month,
AVG(score.score)
FROM tutor
INNER JOIN course ON tutor.course_id = course.course_id
INNER JOIN score ON tutor.course_id = score.course_id
WHERE score.date > DATE '2020-03-01'
AND score.student_id IS NOT NULL
GROUP BY
tutor.tutor_id,
tutor.subject_id,
course.title,
TRUNC(score.date, 'MM');
As for optimisations:
You can try aggregating the score before joining the other tables:
SELECT t.tutor_id,
t.subject_id,
c.title,
s.month,
s.avg_score
FROM ( SELECT course_id,
TRUNC(date, 'MM') AS month,
AVG(score) AS avg_score
FROM score
WHERE date > DATE '2020-03-01'
AND student_id IS NOT NULL
GROUP BY
course_id,
TRUNC(date, 'MM')
) s
INNER JOIN tutor t ON t.course_id = s.course_id
INNER JOIN course c ON c.course_id = s.course_id
This means that you are aggregating by the primary key course_id
and not by course.title
. But that is probably a good thing as aggregating by title would combine two different courses that happen to have the same name; which is probably not what you want.