Home > Enterprise >  SQL - Avoid full scans when joining archive tables
SQL - Avoid full scans when joining archive tables

Time:03-22

I'm having some performance issues due to full scans being run on some larger tables for a report. I've narrowed things down to this section of the query but can't figure out how to avoid the scans without changing the results.

To explain, we have a data archiving system that copies data from the live table to the archive table daily. The data is not removed from the live table until a period of time has passed. This results in a state where the live table and archive table will both have the same rows, but the data in the rows may not match.

This rules out a UNION query (which would eliminate the full scans). The requirements are for the report to show live data, so I also can't query just the archive table.

Any ideas? Here is the query. The primary keys of both tables is DetailIdent, but I do have an index on OrderIdent, as it's a foreign key back to the parent table. You can see that we take the main table results if they exist, otherwise we fall back to the archive data.

SELECT COALESCE(RegOD.OrderIdent, ArcOD.OrderIdent) AS OrderIdent,
                   COALESCE(RegOD.Quantity, ArcOD.Quantity) AS Quantity,
                   COALESCE(RegOD.LoadQuan, ArcOD.LoadQuan) AS LoadQuan,
                   COALESCE(RegOD.ShipQuan, ArcOD.ShipQuan) AS ShipQuan,
                   COALESCE(RegOD.RcvdQuan, ArcOD.RcvdQuan) AS RcvdQuan,
                   COALESCE(RegOD.UOM, ArcOD.UOM) AS UOM,
                   COALESCE(RegOD.SkidType, ArcOD.SkidType) AS SkidType,
                   COALESCE(RegOD.Product, ArcOD.Product) AS Product,
                   COALESCE(RegOD.PkgCode, ArcOD.PkgCode) AS PkgCode
            FROM OrderDetail RegOD
                FULL JOIN dbo.ArcOrderDtl ArcOD
                    ON ArcOD.DetailIdent = RegOD.DetailIdent
                    WHERE COALESCE(RegOD.OrderIdent, ArcOD.OrderIdent) = 717010

execution plan showing two clustered index full scans

CodePudding user response:

The filtering predicate COALESCE(RegOD.OrderIdent,ArcOD.OrderIdent) = 717010 is killing performance and it's forcing the engine to perform a full scan first, and filter data later.

Option 1 - Rephrase the COALESCE() function

Rephrase the COALESCE() function and let the engine do its work. With a bit of luck the engine will be smart enough to find the optimization. In this case the query can take the form:

SELECT
  COALESCE(RegOD.OrderIdent,ArcOD.OrderIdent) AS OrderIdent,
  COALESCE(RegOD.Quantity,ArcOD.Quantity) AS Quantity,
  COALESCE(RegOD.LoadQuan,ArcOD.LoadQuan) AS LoadQuan,
  COALESCE(RegOD.ShipQuan,ArcOD.ShipQuan) AS ShipQuan,
  COALESCE(RegOD.RcvdQuan,ArcOD.RcvdQuan) AS RcvdQuan,
  COALESCE(RegOD.UOM,ArcOD.UOM) AS UOM,
  COALESCE(RegOD.SkidType,ArcOD.SkidType) AS SkidType,
  COALESCE(RegOD.Product,ArcOD.Product) AS Product,
  COALESCE(RegOD.PkgCode,ArcOD.PkgCode) AS PkgCode
FROM OrderDetail RegOD 
FULL JOIN dbo.ArcOrderDtl ArcOD ON ArcOD.DetailIdent = RegOD.DetailIdent
WHERE RegOD.OrderIdent = 717010 or ArcOD.OrderIdent = 717010

Option 2 - Combine a left join with a right anti-join instead of using a full join

If the engine doesn't optimize Option #1 above, you can still try combining a left join with a right anti-join instead of writing a full join (they are equivalent). It's definitively more verbose, but in this case it clearly shows the engine what to do. This query could look like:

SELECT -- left join here
  COALESCE(RegOD.OrderIdent,ArcOD.OrderIdent) AS OrderIdent,
  COALESCE(RegOD.Quantity,ArcOD.Quantity) AS Quantity,
  COALESCE(RegOD.LoadQuan,ArcOD.LoadQuan) AS LoadQuan,
  COALESCE(RegOD.ShipQuan,ArcOD.ShipQuan) AS ShipQuan,
  COALESCE(RegOD.RcvdQuan,ArcOD.RcvdQuan) AS RcvdQuan,
  COALESCE(RegOD.UOM,ArcOD.UOM) AS UOM,
  COALESCE(RegOD.SkidType,ArcOD.SkidType) AS SkidType,
  COALESCE(RegOD.Product,ArcOD.Product) AS Product,
  COALESCE(RegOD.PkgCode,ArcOD.PkgCode) AS PkgCode
FROM OrderDetail RegOD 
LEFT JOIN dbo.ArcOrderDtl ArcOD ON ArcOD.DetailIdent = RegOD.DetailIdent
WHERE RegOD.OrderIdent = 717010
UNION ALL
SELECT -- right anti-join here
  OrderIdent,
  Quantity,
  LoadQuan,
  ShipQuan,
  RcvdQuan,
  UOM,
  SkidType,
  Product,
  PkgCode
FROM dbo.ArcOrderDtl ArcOD
LEFT JOIN OrderDetail RegOD ON ArcOD.DetailIdent = RegOD.DetailIdent
WHERE ArcOD.OrderIdent = 717010 and RegOD.DetailIdent IS NULL

CodePudding user response:

You want all rows for an OrderIdent, but rows (identified by DetailIdent) can be either in OrderDetail or in ArcOrderDtl or both. You want to give precedence to the OrderDetail rows if they exist.

One idea hence is to select all rows and then rank them, giving OrderDetail a better rank than ArcOrderDtl, Use TOP WITH TIES then to get all the better ranked rows and dismiss the others.

SELECT TOP(1) WITH TIES
  OrderIdent, Quantity, LoadQuan, ShipQuan, RcvdQuan, UOM, SkidType, Product, PkgCode
FROM
(
  SELECT 
    DetailIdent, OrderIdent, Quantity, LoadQuan, ShipQuan, RcvdQuan, UOM, SkidType,
    Product, PkgCode, 1 AS priority
  FROM OrderDetail
  WHERE OrderIdent = 717010
  UNION ALL
  SELECT 
    DetailIdent, OrderIdent, Quantity, LoadQuan, ShipQuan, RcvdQuan, UOM, SkidType,
    Product, PkgCode, 2 AS priority
  FROM dbo.ArcOrderDtl
  WHERE OrderIdent = 717010
) unioned
ORDER BY RANK() (PARTITION BY DetailIdent ORDER BY priority);

CodePudding user response:

According to your definition there is no need for FULL JOIN, UNION ALL, COALESCE. If DetailIdent is primary key and is in Clustered index you can use this feature. As you said: OrderDetail has priority and then ArcOrderDetail.

DECLARE @last_not_deleted INT
DECLARE @order_ident INT = 717010
SELECT @last_not_deleted = min(DetailIdent) FROM OrderDetail

SELECT
  OrderIdent,
  Quantity,
  LoadQuan,
  ShipQuan,
  RcvdQuan,
  UOM,
  SkidType,
  Product,
  PkgCode
FROM OrderDetail
WHERE OrderIdent = @order_ident
  UNION 
SELECT
  OrderIdent,
  Quantity,
  LoadQuan,
  ShipQuan,
  RcvdQuan,
  UOM,
  SkidType,
  Product,
  PkgCode
FROM ArchOrderDetail
WHERE OrderIdent = @order_ident
AND DetailIdent < @last_not_deleted
  1. I have selected last not deleted DetailIdent from OrderDetail.
  2. I have selected all valid rows from live table without JOIN, FULL JOIN, LEFT JOIN.
  3. These data (from both tables) can be simply UNION, becasue you know that rows from Archive table are only in Archive table.

This logic will masively improve your performance. And there is simple transformation into SELECT valid rows of multiple events:

DECLARE @last_not_deleted INT
DECLARE @order_ident INT = 717010
SELECT @last_not_deleted = min(DetailIdent) FROM OrderDetail

SELECT
  OrderIdent,
  Quantity,
  LoadQuan,
  ShipQuan,
  RcvdQuan,
  UOM,
  SkidType,
  Product,
  PkgCode
FROM OrderDetail as b
JOIN #order_idents as oi on b.OrderIdent  = oi.OrderIdent 
  UNION 
SELECT
  OrderIdent,
  Quantity,
  LoadQuan,
  ShipQuan,
  RcvdQuan,
  UOM,
  SkidType,
  Product,
  PkgCode
FROM ArchOrderDetail
JOIN #order_idents as oi on b.OrderIdent  = oi.OrderIdent 
AND DetailIdent < @last_not_deleted

Where #order_idents is table with one integer column:

OrderIdent
717010
717011
...
717050

CodePudding user response:

Try to include all column in your index, something like:

CREATE NONCLUSTERED INDEX IX_OrderDetail_OrderIdent
ON OrderDetail (OrderIdent)
INCLUDE (—ALL COLUMN IN SELECT)
GO

CREATE NONCLUSTERED INDEX IX_OrderDtl_OrderIdent
ON OrderDtl (OrderIdent)
INCLUDE (—ALL COLUMN IN SELECT)
GO
  • Related