Leveraging CHECKSUM in MERGE but unable to get all rows to merge-CodePudding

I am having trouble getting MERGE statements to work properly, and I have recently started to try to use checksums.

In the toy example below, I cannot get this row to insert (1, 'ANDREW', 334.3) that is sitting in the staging table.

DROP TABLE TEMP1
DROP TABLE TEMP1_STAGE

-- create table
CREATE TABLE TEMP1
(
    [ID] INT,
    [NAME] VARCHAR(55),
    [SALARY] FLOAT,
    [SCD] INT
)

-- create stage
CREATE TABLE TEMP1_STAGE
(
    [ID] INT,
    [NAME] VARCHAR(55),
    [SALARY] FLOAT,
     [SCD] INT
)

-- insert vals into stage
INSERT INTO TEMP1_STAGE (ID, NAME, SALARY) 
VALUES 
(1, 'ANDREW', 333.3),
(2, 'JOHN', 555.3),
(3, 'SARAH', 444.3)

-- insert stage table into main table
INSERT INTO TEMP1
SELECT *
FROM TEMP1_STAGE;

-- clean up stage table
TRUNCATE TABLE TEMP1_STAGE;

-- put some new values in the stage table
INSERT INTO TEMP1_STAGE (ID, NAME, SALARY) 
VALUES 
(1, 'ANDREW', 334.3),
(4, 'CARL', NULL)

-- CHECKSUMS
update  TEMP1_STAGE 
   set  SCD = binary_checksum(ID, NAME, SALARY);

update  TEMP1 
   set  SCD = binary_checksum(ID, NAME, SALARY);

-- run merge
MERGE TEMP1 AS TARGET
USING TEMP1_STAGE AS SOURCE 
-- match
ON (SOURCE.[ID] = TARGET.[ID])
WHEN NOT MATCHED BY TARGET
THEN INSERT (
[ID], [NAME], [SALARY], [SCD]) VALUES (
         SOURCE.[ID], SOURCE.[NAME], SOURCE.[SALARY], SOURCE.[SCD]);

-- the value: (1, 'ANDREW', 334.3) is not merged in
SELECT * FROM TEMP1;

How can I use the checksum to my advantage in the MERGE?

CodePudding user response：

Your issue is that the NOT MATCHED condition is only considering the ID values specified in the ON condition.

If you want duplicate, but distinct records, include SCD to the ON condition.

If (more likely) your intent is that record ID = 1 be updated with the new 'SALARY', you will need to add a WHEN MATCHED AND SOURCE.SCD <> TARGET.SCD THEN UPDATE ... clause.

That said, the 32-bit int value returned by the `binary_checksum()' function is not sufficiently distinct to avoid collisions and unwanted missed updates. Take a look at HASHBYTES instead. See Binary_Checksum Vs HashBytes function.

Even that may not yield your intended performance gain. Assuming that you have to calculate the hash for all records in the staging table for each update cycle, you may find that it is simpler to just compare each potentially different field before the update. Something like:

WHEN MATCHED AND (SOURCE.NAME <> TARGET.NAME OR SOURCE.SALARY <> TARGET.SALARY)
THEN UPDATE ...

Even then, you need to be careful of potential NULL values and COLLATION. Both NULL <> 50000.00 and 'Andrew' <> 'ANDREW' may not give you the results you expect. It might be easiest and most reliable to just code WHEN MATCHED THEN UPDATE ....

Lastly, I suggest using MONEY instead of FLOAT for Salary.