I have an update which takes a lot of time to finish. 10 millions of rows need to be updated. The execution ended after 6 hours.
This is the query :
update A
set a_top = 'N'
where (a_toto, a_num, a_titi) in
(select a_toto, a_num, a_titi
from A
where a_titi <> 'test' and a_top is null limit 10000000);
Two indexes have been created :
CREATE UNIQUE INDEX pk_A ON A USING btree (a_toto, a_num, a_titi)
CREATE INDEX id1_A ON A USING btree (a_num)
These are the things I already checked :
- No locks
- No triggers on A
The execution plan shows me that the indexes are not used, would it change anything if I drop the indexes, update the rows and then create the indexes after that ?
Is there a way of improving the query itself ?
Here is the execution plan :
Update on A (cost=3561856.86..10792071.61 rows=80305304 width=200)
-> Hash Join (cost=3561856.86..10792071.61 rows=80305304 width=200)
Hash Cond: (((A.a_toto)::text = (("ANY_subquery".a_toto)::text)) AND ((A.a_num)::text = (("ANY_subquery".a_num)::text)) AND ((A.a_titi)::text = (("ANY_subquery".a_titi)::text)))
-> Seq Scan on A (cost=0.00..2509069.04 rows=80305304 width=126)
-> Hash (cost=3490830.00..3490830.00 rows=2082792 width=108)
-> Unique (cost=3390830.00..3490830.00 rows=2082792 width=108)
-> Sort (cost=3390830.00..3415830.00 rows=10000000 width=108)
Sort Key: (("ANY_subquery".a_toto)::text), (("ANY_subquery".a_num)::text), (("ANY_subquery".a_titi)::text)
-> Subquery Scan on "ANY_subquery" (cost=0.00..484987.17 rows=10000000 width=108)
-> Limit (cost=0.00..384987.17 rows=10000000 width=42)
-> Seq Scan on A A_1 (cost=0.00..2709832.30 rows=70387600 width=42)
Filter: ((a_top IS NULL) AND ((a_titi)::text <> 'test'::text))
(12 rows)
Thanks for you help.
CodePudding user response:
The index I would have suggested is:
CREATE UNIQUE INDEX pk_A ON A USING btree (a_titi, a_top, a_toto, a_num);
This index covers the WHERE
clause of the subquery, allowing Postgres to throw away records which don't meet the criteria. The index also includes the three columns which are needed in the SELECT
clause.
One reason your current first index is not being used is that it doesn't cover the WHERE
clause. This index, if used, might require a full index scan, during which Postgres would have to manually filter off non matching records.
CodePudding user response:
PostgreSQL does a pretty poor job of optimizing bulk updates, because it optimizes it (almost) just like a select, and throws an update on top. It doesn't consider how the order of the rows returned by the select-like-portion will effect the IO pattern of the update itself. This can have devastatingly poor performance for high latency devices, like spinning disks. And is often bad even for SSD.
My theory is that you could get greatly improved performance by injecting a Sort by ctid node just below the Update node. But it looks really hard to do this, even in a gross and hackish way just to get a proof-of-concept.
But if the Hash node can fit the entire hash table in work_mem, rather than spilling to disk, then the Hash Join should return tuples in physical order so they can be updated efficiently. This would require a work_mem very much larger than 4MB, though. (But it is hard to say how much larger without trial and error. But even if spills to disk in 4 batches, that should be much better than hundreds.)
You can probably get it to use an index plan by setting both enable_hashjoin and enable_mergejoin to off. But whether this will actually be faster is another question, it might have the same random IO problem as the current method does. Unless the table is clustered or something like that.
You should really go back to your client and ask what they are trying to accomplish here. If they would just update the table in one shot without the self join, they wouldn't have this problem. If they are using the LIMIT to try to get the UPDATE to run faster, then it is probably backfiring spectacularly. If they are doing it for some other reason, well, what is it?