I met performance issue with batchUpdate for multiple rows on mysql 8 by using java.
There are more than 400,000 rows in the table , I was trying to update a column by it's id.
The sql is like: update table_name set status='status' where id = 1.
I used batchUpdate to update 10,000 rows per time.
There is an index for the column "id", and the parameter was set to jdbcUrl too:
rewriteBatchedStatements=true
The java code is :
String sql = "update table_name set status='status' where id=?";
try(Connection conn = ds.getConnection();PreparedStatement ps = conn.prepareStatement(sql)){
for(int i=0;i<idList.size();i ){
ps.setInt(1,idList.get(i));
ps.addBatch();
if((i000==0&&i>0)||i==idList.size()-1){
System.out.println("now we execute from " (i-10000) " to " i);
long s = System.currentTimeMillis();
ps.executeBatch();
System.out.println("execute from " (i-10000) " to " i " took " (System.currentTimeMillis()-s));
}
}
}catch(Exception ex){
ex.printStackTrace();
}
By executing this code, I found it was very slow. It took about 60 to 90 seconds to update 10,000 records.
But when I called this code again, I found it only took 1 second to finish the job ,only if these rows were updated once.
I just wonder why the performace is so different. Does it cause by some cache mechanism?
When the performance was poor, I checked the innodb status, and found the update operation was waiting for handler commit. So I understand why it is slow. But why it was much faster when I run it again?
The phenomenon only occurs when I am not using transaction. Once the transaction is added to the code, it always takes about 1 second.
CodePudding user response:
UPDATE
must save the old copy of each row. It tries to save it in a log file, but if that fills up, it does it in a slower way, using the actual tables. I feel much safer doing batches of only 100 to 1000.
Also, any secondary indexes are queued up for later storing to disk. (Don't worry, even a system crash will not lose them.)
10,000 is well into "diminishing returns", so there is no advantage in going that high. And as mentioned above, a potential disadvantage.
When an UPDATE
does not change any values, it can run faster.
To change an entire column of a table smells like a design flaw. Perhaps a timestamp would let you discover the "status" during any subsequent SELECT
? This might completely eliminate the big UPDATE
. Or maybe the status
can be stored in another table?