UPDATE with LEFT JOIN and IS NULL results in extremely low performance-CodePudding

Today, we had major performance issue caused by this exact query (with different names of course):

UPDATE foo
LEFT JOIN bar 
ON bar.foo_id = foo.id
SET foo.flag = 1
WHERE bar.foo_id IS NULL;

The software is third-party and we didn't write the code or database schema.

foo has 42k rows, bar has 130k. A particular foo.id occurred very often in bar.foo_id. (bar is a relationship table implementing an m:n relationship between foo and qux. Almost each row in qux will have a relationship with a specific foo row and then to several more that are unique to it. There may be more duplicates, likely not important.) An index on foo.id exists. bar.foo_id is part of an index also containing bar.qux_id and unrelated bar.text.

The query ran for 55min at 100% CPU of a db.t3.medium instance using Aurora MySQL 5.7. The slowlog stated Rows_examined: 5719954827.

Can someone explain what exactly the reason for this number and the resulting low performance is? 42k * 130k is close to it so that probably has something to do with it.

In our case, we could simply GROUP BY foo_id and then join with the result instead of bar which solved the problem. However, that won't be possible in all circumstances - and I don't think creating an index on bar.foo_id should have an effect as NULL values can only occur due to the JOIN statement.

The EXPLAIN output is as follows:

id|select_type|table            |partitions|type |possible_keys|key                  |key_len|ref|rows  |filtered|Extra                   |
-- ----------- ----------------- ---------- ----- ------------- --------------------- ------- --- ------ -------- ------------------------ 
 1|UPDATE     |foo              |          |ALL  |             |                     |       |   | 42921|   100.0|                        |
 1|SIMPLE     |bar              |          |index|             |uidx_qux_foo         |780    |   |129465|    10.0|Using where; Using index|

To my knowledge, EXPLAIN PLAN is not available in MySQL.

CodePudding user response：

You're only writing fields in one table, the aptly-named foo, so write your query to only seek/scan once into that one table:

update foo set flag = 1 where ...

Instead, what you wrote does a product of your two (three?) tables and then sets the value for every single one of the resulting rows. That's needlessly wasteful, since, again, you only want to set data in your foo table at most once per row.

And once you fix your overall logic, what you're left with is the most typical database access pattern: providing correct indexes for your select statement (ie, the part after where above). A simple explain plan should show you exactly what the problem is, if any. Your post is of no help at all here.

CodePudding user response：

Rows_examined: 5719954827 is a strong indicator.

Change the query to this "semi-join". It will not need to repeatedly set foo.flag, but will do int only once per row in foo.

UPDATE foo
    SET foo.flag = 1
    WHERE NOT EXISTS ( SELECT 1 FROM bar
                         WHERE bar.foo_id = foo.id );

That will probably give you something like "84K rows examined" (84K = 2*42K).

And bar does need an INDEX starting with foo_id.

Why do you need this flag? Why do you need to change lots of rows at once? Can't you design the app some other way?

For any further discuss, please provide SHOW CREATE TABLE.