Home > Software design >  How to optimize an UPDATE and JOIN query on practically identical tables?
How to optimize an UPDATE and JOIN query on practically identical tables?

Time:11-27

I am trying to update one table based on another in the most efficient way.

Here is the table DDL of what I am trying to update

Table1

CREATE TABLE `customersPrimary` (
  `id` int NOT NULL AUTO_INCREMENT,
  `groupID` int NOT NULL,
  `IDInGroup` int NOT NULL,
  `name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `groupID-IDInGroup` (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Table2

CREATE TABLE `customersSecondary` (
  `groupID` int NOT NULL,
  `IDInGroup` int NOT NULL,
  `name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Both the tables are practically identical but customersSecondary table is a staging table for the other by design. The big difference is primary keys. Table 1 has an auto incrementing primary key, table 2 has a composite primary key.

In both tables the combination of groupID and IDInGroup are unique.

Here is the query I want to optimize

UPDATE customersPrimary 
INNER JOIN customersSecondary ON 
  (customersPrimary.groupID = customersSecondary.groupID
  AND customersPrimary.IDInGroup =  customersSecondary.IDInGroup)
SET
 customersPrimary.name = customersSecondary.name, 
 customersPrimary.address = customersSecondary.address

This query works but scans EVERY row in customersSecondary.

Adding

WHERE customersPrimary.groupID = (groupID)

Cuts it down significantly to the number of rows with the GroupID in customersSecondary. But this is still often far larger than the number of rows being updated since the groupID can be large. I think the WHERE needs improvement.

I can control table structure and add indexes. I will have to keep both tables.

Any suggestions would be helpful.

CodePudding user response:

Your existing query requires a full table scan because you are saying update everything on the left based on the value on the right. Presumably the optimiser is choosing customersSecondary because it has fewer rows, or at least it thinks it has.

Is the full table scan causing you problems? Locking? Too slow? How long does it take? How frequently are the tables synced? How many records are there in each table? What is the rate of change in each of the tables?

You could add separate indices on name and address but that will take a good chunk of space. The better option is going to be to add an indexed updatedAt column and use that to track which records have been changed.

ALTER TABLE `customersPrimary`
    ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT '2000-01-01 00:00:00',
    ADD INDEX `idx_customer_primary_updated` (`updatedAt`);

ALTER TABLE `customersSecondary`
    ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    ADD INDEX `idx_customer_secondary_updated` (`updatedAt`);

And then you can add updatedAt to your join criteria and the WHERE clause -

UPDATE customersPrimary cp
INNER JOIN customersSecondary cs
    ON cp.groupID = cs.groupID
    AND cp.IDInGroup =  cs.IDInGroup
    AND cp.updatedAt < cs.updatedAt
SET
    cp.name = cs.name, 
    cp.address = cs.address,
    cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > :last_query_run_time;

For :last_query_run_time you could use the last run time if you are storing it. Otherwise, if you know you are running the query every hour you could use NOW() - INTERVAL 65 MINUTE. Notice I have used more than one hour to make sure records aren't missed if there is a slight delay for some reason. Another option would be to use SELECT MAX(updatedAt) FROM customersPrimary -

UPDATE customersPrimary cp
INNER JOIN (SELECT MAX(updatedAt) maxUpdatedAt FROM customersPrimary) t
INNER JOIN customersSecondary cs
    ON cp.groupID = cs.groupID
    AND cp.IDInGroup =  cs.IDInGroup
    AND cp.updatedAt < cs.updatedAt
SET
    cp.name = cs.name, 
    cp.address = cs.address,
    cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > t.maxUpdatedAt;
  • Related