I have 2 tables. TableA gets populated by a csv import and typically contains between 10k and 15k rows. TableB has the same structure, and has now grown to about 95k rows. In order to determine rows in TableA that are not in TableB, I need to compare a concatenation of 4 fields in TableA with the same concatenation in TableB.
The code below has been working as TableB has been growing, but is just taking so long that it needs to be cancelled and does not finish.
I strongly believe that the use of concatenated fields as a comparison is causing execution times to grow beyond usability.
Is there a better approach to the problem?
DELETE FROM billing..whse_Temp
BULK INSERT billing..whse_Temp
FROM '/mnt/ABC/ABC.csv'
WITH
(
FORMAT='csv',
FIRSTROW=2,
FIELDTERMINATOR=',',
ROWTERMINATOR='\r\n'
)
INSERT INTO billing..whse
SELECT * FROM billing..whse_Temp S
WHERE CONCAT(S.RunTimeStamp, S.CS_Datacenter,S.Customer, S.ServerName) NOT IN
(
SELECT CONCAT(RunTimeStamp, CS_Datacenter, Customer, ServerName)
FROM billing..whse
)
CodePudding user response:
Simply use NOT EXISTS
:
INSERT INTO billing..whse
SELECT * FROM billing..whse_temp S
WHERE NOT EXISTS
(
SELECT NULL
FROM billing..whse w
WHERE w.runtimestamp = s.runtimestamp
AND w.cs_datacenter = s.cs_datacenter
AND w.customer = s.customer
AND w.servername = s.servername
);
The appropriate index for this:
CREATE INDEX idx ON billing..whse (runtimestamp, cs_datacenter, customer, servername);
CodePudding user response:
I'm sure there are ways to do it with a MERGE command, but I've never really used those. I'm sure there's ways with EXISTS across multiple columns, but I personally find it clearer to have the full join condition and then just test for where the join failed. (ie: no row on the right side):
INSERT INTO billing..whse
SELECT S.*
FROM billing..whse_Temp S
LEFT OUTER JOIN billing..whse W
ON S.RunTimeStamp = W.RunTimeStamp
AND S.CS_Datacenter = W.CS_Datacenter
AND S.Customer = W.Customer
AND S.ServerName = W.ServerName
WHERE W.RunTimeStamp IS NULL