We need to get the minimum and maximum dates over common location, project and form.
the pseudo-code of such query looks like:
SELECT A.FORM,
A.LOCATION,
A.PROJECT,
MN.START_DATE,
MX.END_DATE
FROM
(
) A
JOIN (
SELECT ...
FROM
-- JOINS ON N TABLES
) MN ON MN.FORM = A.FORM AND MN.LOCATION = A.LOCATION AND MN.PROJECT = A.PROJECT
JOIN (
-- SAME JOINS ON SAME N TABLES
) MX ON MX.FORM = A.FORM AND MX.LOCATION = A.LOCATION AND MX.PROJECT = A.PROJECT
query of MN and MX are exactly the same but the date field whereas MIN(DATE) is coming from MN, and MAX(DATE) is coming from MX join.
is it possible to combine both MN into MX in a way to reduce execution time and increase performance as this is running over thousands of rows of data.
here is the exact query:
SELECT
MX.ACTIVITY_CODE,
MN.FORM_ID_STRING,
MX.PROJECT_CODE,
DATE(MN.MIN_START_DATE) AS MIN_START_DATE,
DATE(MX.MAX_END_DATE) AS MAX_END_DATE
FROM
(
-- Getting the min start date of submissions related to same location, project and activity
SELECT HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
MIN(SO.OBSERVATION_START_DT) MIN_START_DATE
FROM DATA_VAULT.HUB_OBSERVATION HO
JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
WHERE
SO.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)
AND SOR.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION_REVIEW WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)
AND SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
GROUP BY HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE
) MN
JOIN
(
-- Getting the max end date of submissions related to same location, project and activity
SELECT HO.FORM_ID_STRING as form, HO.LOCATION_NAME as loc, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
MAX(SO.OBSERVATION_START_DT) as MAX_END_DATE
FROM DATA_VAULT.HUB_OBSERVATION HO
JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
WHERE
// HO.FORM_ID_STRING = 'Global_rh_skilled_facility_delivery_v1_0_1_data.csv'
SO.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)
AND SOR.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION_REVIEW WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)
AND SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
GROUP BY HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE
) MX
ON MX.form = MN.form_id_string AND MX.loc = MN.location_name AND MN.PROJECT_CODE = MX.PROJECT_CODE
CodePudding user response:
@HoneyBadger is correct. You can fetch both the MIN and MAX start dates in the same subquery, so you don't need to repeat it. That should save some time, IO, and CPU.
SELECT ...,
MIN(SO.OBSERVATION_START_DT) MIN_START_DATE,
MAX(SO.OBSERVATION_START_DT) MAX_END_DATE
And, you have a couple of correlated subqueries in the WHERE clause of your subquery. You can change both of those to a single JOINed subquery and save some more time.
This is what your single big subquery might look like with those two changes. I cannot debug this because I don't have your data and can only guess at your spec.
SELECT HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
MIN(SO.OBSERVATION_START_DT) MIN_START_DATE, /* min and max in */
MAX(SO.OBSERVATION_START_DT) MAX_END_DATE /* just one query */
FROM DATA_VAULT.HUB_OBSERVATION HO
JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
JOIN ( /* non-correlated subquery to find max LOAD_DT values */
SELECT OBSERVATION_HKEY, MAX(LOAD_DT) LOAD_DT
FROM DATA_VAULT.SAT_OBSERVATION
GROUP BY OBSERVATION_HKEY
) SOMAX ON SO.LOAD_DT = SOMAX.LOAD_DT
AND SO.OBSERVATION_HKEY = SOMAX.OBSERVATION_HKEY
AND SOR.LOAD_DT = SOMAX.LOAD_DT
AND SOR.OBSERVATION_HKEY = SOMAX.OBSERVATION_HKEY
JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
WHERE SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
GROUP BY HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE
The next step, if this doesn't give you enough performance, is to look at the execution plan (the EXPLAIN output) and try to work out whether some more indexes or different indexes will help. Searching for MIN and MAX values can be astonishingly fast with the right indexes.
That is a good subject for another question.
CodePudding user response:
Get the max and min without all those joins, if possible.
Get the max and min in the same query, if possible.
Get MAX(LOAD_DT) once, not twice.
Then JOIN to all the other tables.
Some of these indexes may be useful:
HO: INDEX(FORM_ID_STRING, OBSERVATION_HKEY, LOCATION_NAME)
HPC: INDEX(PROJECT_CODE_HKEY, PROJECT_CODE)
HA: INDEX(ACTIVITY_HKEY, ACTIVITY_CODE)
SO: INDEX(LOAD_DT, OBSERVATION_HKEY, OBSERVATION_START_DT)
SOR: INDEX(LOAD_DT, REVIEW_STATUS, OBSERVATION_HKEY)
LOPC: INDEX(OBSERVATION_HKEY, PROJECT_CODE_HKEY)
LOCA: INDEX(OBSERVATION_HKEY, ACTIVITY_HKEY)
HO: INDEX(OBSERVATION_HKEY)
For further discussion, please provide SHOW CREATE TABLE
for all the tables; it sounds like they might be over-normalized.