Home > Software design >  MySQL update of large table based on another large table too slow
MySQL update of large table based on another large table too slow

Time:10-29

I have one table that looks like this:

 ------------- -------------- ------ ----- --------- ------- 
| Field       | Type         | Null | Key | Default | Extra |
 ------------- -------------- ------ ----- --------- ------- 
| name        | varchar(255) | NO   | PRI | NULL    |       |
| timestamp1  | int          | NO   |     | NULL    |       |
| timestamp2  | int          | NO   |     | NULL    |       |
 ------------- -------------- ------ ----- --------- ------- 

This table has around 250 million rows in it. I get a csv once a day that contains around 225 million rows of just one name column. 99% of the names that are in the csv I get everyday are already in the database. So what I want to do is for all the ones that are already there I update their timestamp1 column to UNIX_TIMESTAMP(NOW()). Then all the names that are not in the original table, but are in the csv I add to the original table. Right now this is how I am doing this:

DROP TEMPORARY TABLE IF EXISTS tmp_import;
CREATE TEMPORARY TABLE tmp_import (name VARCHAR(255), primary_key(name));
LOAD DATA LOCAL INFILE 'path.csv' INTO TABLE tmp_import LINES TERMINATED BY '\n';
UPDATE og_table tb SET timestamp1 = UNIX_TIMESTAMP(NOW()) WHERE og.name IN (SELECT tmp.name FROM tmp_import tmp);
DELETE FROM tmp_import WHERE name in (SELECT og.name FROM og_table og);
INSERT INTO og_table SELECT name, UNIX_TIMESTAMP(NOW()) AS timestamp1, UNIX_TIMESTAMP(NOW()) AS timestamp2 FROM tmp_import;

As someone might guess the update line is taking a long time, over 6 hours or throwing an error. Reading the data in is taking upwards of 40 minutes. I know this is mostly because it is creating an index for name when I don't set it as a primary key it only takes 9 minutes to read the data in, but I thought having an index would speed up the operation. I have tried the update several different way. What I have and the following:

UPDATE og_table og SET timestamp1 = UNIX_TIMESTAMP(NOW()) WHERE EXISTS (SELECT tmp.name FROM tmp_import tmp where tmp.name = og.name);
UPDATE og_table og inner join tmp_import tmp on og.name=tmp.name SET og.timestamp1 = UNIX_TIMESTAMP(NOW());

Both of those attempts did not work. It normally takes several hours and then ends up with:

ERROR 1206 (HY000): The total number of locks exceeds the lock table size

I am using InnoDB for these tables, but there are no necessary foreign keys and having the benefit of that engine is not necessarily needed so I would be open to trying different storage engines.

I have been looking through a lot of posts and have yet to find something to help in my situation. If I missed a post I apologize.

CodePudding user response:

If the name values are rather long, you might greatly improve performance by using a hash function, such as MD5 or SHA-1 and store&index the hash only. You probably don't even need all 128 or 160 bits. 80-bit portions should be good enough with a very low chance of a collision. See this.

Another thing you might want to check is if you have enough RAM. How big is your table and how much RAM do you have? Also, it's not just about how much RAM you have on the machine, but how much of it is available to MySQL/InnoDB's buffer cache.

What disk are you using? If you are using a spinning disk (HDD), that might be a huge bottleneck if InnoDB needs to constantly make scattered reads.

There are many other things that might help, but I would need more details. For example, if the names in the CSV are not sorted, and your buffer caches are about 10-20% of the table size, you might have a huge performance boost by splitting the work in batches, so that names in each batch are close enough (for example, first process all names that start with 'A', then those starting with 'B', etc.). Why would that help? In a big index (in InnoDB tables are also implemented as indexes) that doesn't fit into the buffer cache, if you make millions of reads all around the index, the DB will need to constantly read from the disk. But if you work on a smaller area, the data blocks (pages) will only be read once and then they will stay in the RAM for subsequent reads until you've finished with that area. That alone might easily improve performance by 1 or 2 orders of magnitude, depending on your case.

CodePudding user response:

A big update (as Barmar points out) takes a long time. Let's avoid it by building a new table, then swapping it into place.

First, let me get clarification and provide a minimal example.

You won't be deleting any rows, correct? Just adding or updating rows?

You have (in og_table):

A   88     123
B   99     234

The daily load (tmp_import) says

B
C

You want

A   88     123
B   NOW()  234
C   NOW()  NULL

Is that correct? Now for the code:

  1. load nightly data and build the merge table:

     LOAD DATA ... (name)  -- into TEMPORARY tmp_import
     CREATE TABLE merge LIKE og_table;  -- not TEMPORARY
    
  2. Populate a new table with the data merged together

     INSERT INTO merge
         -- B and C (from the example):
         ( SELECT ti.name, FROM_UNIXTIME(NOW()), og.timestamp2
             FROM tmp_import AS ti
             LEFT JOIN og_table AS  USING(name)
         ) UNION ALL
         -- A:
         ( SELECT og.name, og.timestamp1, og.timestamp2
             FROM og_table AS og
             LEFT JOIN tmp_import AS ti  USING(name)
             WHERE ti.name IS NULL   -- (that is, missing from csv)
         );
    
  3. Swap it into place

     RENAME TABLE og_table TO x,
                  merge TO og_table;
     DROP TABLE x;
    

Bonus: og_table is "down" only very briefly (during the RENAME).

A possible speed-up: Sort the CSV file by name before loading it. (If that takes an extra step, the cost of that step may be worse than the cost of not having the data sorted. There is not enough information to predict.)

  • Related