I am trying to improve the performance of a query using a "materialized view" to optimize away joins. The first query below is the original, which employs joins. The second is the query written against a table i generated which includes all the joined data (the equivalent of a materialized view). They both return the same result set. Unfortunalatey, somehow, the second query is MUCH slower when handling a very long set of input ids (the IN clause). I don't understand how that could be!!!! Executing all the joins has to have a fair amount of overheat that is saved by the "materialized view", right?
SELECT
clinical_sample.INTERNAL_ID AS "internalId",
sample.STABLE_ID AS "sampleId",
patient.STABLE_ID AS "patientId",
clinical_sample.ATTR_ID AS "attrId",
cancer_study.CANCER_STUDY_IDENTIFIER AS "studyId",
clinical_sample.ATTR_VALUE AS "attrValue"
FROM clinical_sample
INNER JOIN sample ON clinical_sample.INTERNAL_ID = sample.INTERNAL_ID
INNER JOIN patient ON sample.PATIENT_ID = patient.INTERNAL_ID
INNER JOIN cancer_study ON patient.CANCER_STUDY_ID =
cancer_study.CANCER_STUDY_ID
WHERE cancer_study.CANCER_STUDY_IDENTIFIER = 'xxxxx'
AND sample.STABLE_ID IN
('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
'P-0068363-T01-IM7' , 'P-0068364-T01-IM7' )
AND clinical_sample.ATTR_ID IN
(
'CANCER_TYPE'
);
SELECT
internalId,
sampleId,
patientId,
attrId,
studyId,
attrValue
FROM test
WHERE
sampleId IN ('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
'P-0068363-T01-IM7' , 'P-0068364-T01-IM7' )
AND studyId = 'xxxxx'
AND attrId = 'CANCER_TYPE';
Update: I did notice in Workbench report that the query with joins seems to scan far fewer rows. About 829k vs ~2400k for the second, joinless query. So having joins seems to actually be a major optimization somehow. I have index in sampleId, studyId, attrId and composite of all three.
Both table "test" and "clinical_sample" have the same number of rows.
CodePudding user response:
It would help to see what the PRIMARY KEY
of each table is.
Some of these indexes are likely to help:
clinical_sample: INDEX(ATTR_ID, INTERNAL_ID, ATTR_VALUE)
sample: INDEX(STABLE_ID, INTERNAL_ID, PATIENT_ID)
patient: INDEX(INTERNAL_ID, STABLE_ID, CANCER_STUDY_ID)
cancer_study: INDEX(CANCER_STUDY_IDENTIFIER, CANCER_STUDY_ID)
I agree with Barmar's INDEX(studyId, attrId, sampleId)
for the materialized view.
I have index in sampleId, studyId, attrId and composite of all three.
Let's see the EXPLAIN
. It may show that it is using your index just on (sampleId)
when it should be using the composite index.
Also put the IN
column last, not first, regardless of cardinality. More precisely, put =
columns first in a composite index.
CodePudding user response:
Food for thought: When and why are database joins expensive? this leads me to believe that normalized tables with indexes could actually be faster than my denormalized attempt (materialized view).