Home > Software design >  Improving performance of a full outer join in Redshift
Improving performance of a full outer join in Redshift

Time:01-20

I need to complete a full outer join using two columns: date_local and hour_local but am concerned about query performance when using an outer join as opposed to another type of join.

Below is the query using outer join:

SELECT *
FROM TABLE_A
FULL OUTER JOIN TABLE_B
USING (DATE_LOCAL, HOUR_LOCAL)

Would the following query perform better than the query above:

WITH JOIN_VALS AS
         (SELECT DATE_LOCAL
               , HOUR_LOCAL
          FRΩM TABLE_A
          UNION
          SELECT
              DATE_LOCAL
            , HOUR_LOCAL
          FROM TABLE_B
          )
SELECT
    JV.DATE
  , JV.HOUR_LOCAL
  , TA.PLANNED
  , TB.ACTUAL
FROM JOIN_VALS JV
LEFT JOIN TABLE_A TA 
    ON JV.DATE = TA.DATE 
    AND JV.HOUR_LOCAL = TA.HOUR_LOCAL
LEFT JOIN TABLE_B TB 
    ON JV.DATE = TB.DATE 
    AND JV.HOUR_LOCAL = TB.HOUR_LOCAL;

Wondering if I get any performance improvements but isolating the unique join values first, rather than finding them during the outer join.

CodePudding user response:

UNION can be expensive and I don’t think you will seen any benefit from this construct in Redshift. Likely performance loss. Redshift is a columnar database and will see no benefit from peeling off these columns.

The big cost will be if the matches between the two tables on these two columns will be many-to-many. This would lead to additional row creation which could lead to slow performance.

  • Related