Home > OS >  How to increase performance of a query against data vault architecture when joins with same select q
How to increase performance of a query against data vault architecture when joins with same select q

Time:06-22

We need to get the minimum and maximum dates over common location, project and form.

the pseudo-code of such query looks like:

SELECT A.FORM, 
A.LOCATION,
A.PROJECT,
MN.START_DATE,
MX.END_DATE

FROM

(
) A

JOIN (
SELECT ...
FROM 
-- JOINS ON N TABLES

) MN ON MN.FORM = A.FORM AND MN.LOCATION = A.LOCATION AND MN.PROJECT = A.PROJECT

JOIN (
-- SAME JOINS ON SAME N TABLES

) MX ON MX.FORM = A.FORM AND MX.LOCATION = A.LOCATION AND MX.PROJECT = A.PROJECT

query of MN and MX are exactly the same but the date field whereas MIN(DATE) is coming from MN, and MAX(DATE) is coming from MX join.

is it possible to combine both MN into MX in a way to reduce execution time and increase performance as this is running over thousands of rows of data.

here is the exact query:

SELECT 
MX.ACTIVITY_CODE,
MN.FORM_ID_STRING, 
MX.PROJECT_CODE, 
DATE(MN.MIN_START_DATE) AS MIN_START_DATE, 
DATE(MX.MAX_END_DATE) AS MAX_END_DATE

FROM
(
  -- Getting the min start date of submissions related to same location, project and activity
  SELECT HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
  MIN(SO.OBSERVATION_START_DT) MIN_START_DATE
  FROM DATA_VAULT.HUB_OBSERVATION HO 
  JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
  JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
  WHERE 
  SO.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)
  AND SOR.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION_REVIEW WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)

  AND SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
  GROUP BY  HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE
) MN
JOIN
(
  -- Getting the max end date of submissions related to same location, project and activity
  SELECT HO.FORM_ID_STRING as form, HO.LOCATION_NAME as loc, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
  MAX(SO.OBSERVATION_START_DT) as MAX_END_DATE
  FROM DATA_VAULT.HUB_OBSERVATION HO 
  JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
  JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
  WHERE 
//  HO.FORM_ID_STRING = 'Global_rh_skilled_facility_delivery_v1_0_1_data.csv'
  SO.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)
  AND SOR.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION_REVIEW WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)

  AND SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
  GROUP BY  HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE
) MX 
ON MX.form = MN.form_id_string AND MX.loc = MN.location_name AND MN.PROJECT_CODE = MX.PROJECT_CODE

CodePudding user response:

@HoneyBadger is correct. You can fetch both the MIN and MAX start dates in the same subquery, so you don't need to repeat it. That should save some time, IO, and CPU.

SELECT ...,
       MIN(SO.OBSERVATION_START_DT) MIN_START_DATE,
       MAX(SO.OBSERVATION_START_DT) MAX_END_DATE

And, you have a couple of correlated subqueries in the WHERE clause of your subquery. You can change both of those to a single JOINed subquery and save some more time.

This is what your single big subquery might look like with those two changes. I cannot debug this because I don't have your data and can only guess at your spec.

SELECT HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
       MIN(SO.OBSERVATION_START_DT) MIN_START_DATE,  /* min and max in */
       MAX(SO.OBSERVATION_START_DT) MAX_END_DATE     /* just one query */
  FROM DATA_VAULT.HUB_OBSERVATION HO 
  JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
  JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN (        /* non-correlated subquery to find max LOAD_DT values */
                SELECT OBSERVATION_HKEY, MAX(LOAD_DT) LOAD_DT
                  FROM DATA_VAULT.SAT_OBSERVATION
                 GROUP BY OBSERVATION_HKEY
   ) SOMAX   ON SO.LOAD_DT = SOMAX.LOAD_DT
            AND SO.OBSERVATION_HKEY = SOMAX.OBSERVATION_HKEY
            AND SOR.LOAD_DT = SOMAX.LOAD_DT
            AND SOR.OBSERVATION_HKEY = SOMAX.OBSERVATION_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
 WHERE SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
 GROUP BY HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE

The next step, if this doesn't give you enough performance, is to look at the execution plan (the EXPLAIN output) and try to work out whether some more indexes or different indexes will help. Searching for MIN and MAX values can be astonishingly fast with the right indexes.

That is a good subject for another question.

CodePudding user response:

Get the max and min without all those joins, if possible.

Get the max and min in the same query, if possible.

Get MAX(LOAD_DT) once, not twice.

Then JOIN to all the other tables.

Some of these indexes may be useful:

HO:  INDEX(FORM_ID_STRING, OBSERVATION_HKEY,  LOCATION_NAME)
HPC:  INDEX(PROJECT_CODE_HKEY,  PROJECT_CODE)
HA:  INDEX(ACTIVITY_HKEY,  ACTIVITY_CODE)
SO:  INDEX(LOAD_DT, OBSERVATION_HKEY,  OBSERVATION_START_DT)
SOR:  INDEX(LOAD_DT, REVIEW_STATUS, OBSERVATION_HKEY)
LOPC:  INDEX(OBSERVATION_HKEY,  PROJECT_CODE_HKEY)
LOCA:  INDEX(OBSERVATION_HKEY,  ACTIVITY_HKEY)
HO:  INDEX(OBSERVATION_HKEY)

For further discussion, please provide SHOW CREATE TABLE for all the tables; it sounds like they might be over-normalized.

  • Related