How to optimise the sql query which runs on millions of records-CodePudding

I have a sql table with following columns: id,name,employee_number,age,class

I need to update the employee_number which is of the form: 1234 56XX XXXX 7890

to the new value: XXXX XXXX XXXX 7890

I have created the following query to do the task and written a script to execute it in loop:

UPDATE tablename 
    SET employee_number = concat('XXXX XXXX XXXX ',right(card_number,4)) 
where employee_number regexp '^[0-9]{4}' 
order by id asc 
limit ?";

The '?' is being replaced by the user argument which I have given as 50 000 records. I have 200 million records. The script is still running from 3 days and it has updated only 70 million records. I have put a sleep of 5 seconds after each update to maintain cpu usage at a nominal level.

My questions are:

Is the query correct to do the required task?
Why is it taking this much time? is it due to the regexp?
Is there a better way? Is it possible to do small changes to the query so that it runs quickly on all the records. quickly meaning at least a day or two maybe?

Please suggest a better and optimised way.

CodePudding user response：

It's true that REGEXP cannot use an index. It will require a table-scan of all 200 million rows during every update. This is probably why it is taking so long. You have done 1400 big table-scans in 3 days. :-(

I would do it this way:

UPDATE tablename 
  SET employee_number = CONCAT('XXXX XXXX XXXX ', RIGHT(card_number, 4)) 
WHERE id BETWEEN ? AND ?
  AND employee_number REGEXP '^[0-9]{4}'

Selecting a range of rows by primary key is pretty efficient. It doesn't scan the whole table. It only examines the rows that match the range.

The first value to use for ? AND ? starts with the minimum id value, which may be 1, or it might be higher in your table. The second value is equal to the first value plus the batch size. I would choose a smaller batch size, even a small batch like 1000.

Once the batch finishes, add 1000 (or whatever you choose as the batch size) to the two parameter numbers, and run it again. Each batch should run faster, because it doesn't have to do a table-scan.

It's okay if there are gaps in the id values. For example, suppose some rows between 11250 and 11350 are missing. It just means the batch of 1000 only scans 900 rows, and it will simply finish more quickly. Then you move on to the next batch.

I assume that id is the primary key, because this is a popular design. But it would be helpful when you ask a question about query optimization to include the result of SHOW CREATE TABLE <tablename> in your question, so we don't have to make any assumption about which column is the primary key, or the data types, indexes, or constraints in the table. It helps us to help you! :-)