Home > database >  Optimise a simple update query for large dataset
Optimise a simple update query for large dataset

Time:10-27

I have some data migration that has to occur between a parent and child table. For the sake of simplicity, the schemas are as follows:

 -------    -----------
| event |  | parameter |
 -------    -----------
| id    |  | id        |
| order |  | eventId   |
 -------   | order     |
            -----------

Because of an oversight with business logic that needs to be performed, we need to update parameter.order to the parent event.order. I have come up with the following SQL to do that:

UPDATE "parameter"
SET "order" = e."order"
FROM "event" e
WHERE "eventId" = e.id

The problem is that this query didn't resolve after over 4 hours and I had to clock out, so I cancelled it.

There are 11 million rows on parameter and 4 million rows on event. I've run EXPLAIN on the query and it tells me this:

Update on parameter (cost=706691.80..1706622.39 rows=11217313 width=155)
  ->  Hash Join  (cost=706691.80..1706622.39 rows=11217313 width=155)
        Hash Cond: (parameter."eventId" = e.id)
        ->  Seq Scan on parameter  (cost=0.00..435684.13 rows=11217313 width=145)
        ->  Hash  (cost=557324.91..557324.91 rows=7724791 width=26)
              ->  Seq Scan on event e  (cost=0.00..557324.91 rows=7724791 width=26)

Based on this article it tells me that the "cost" referenced by the EXPLAIN is an "arbitrary unit of computation".

Ultimately, this update needs to be performed, but I would accept it happening in one of two ways:

  1. I am advised of a better way to do this query that executes in a timely manner (I'm open to all suggestions, including updating schemas, indexing, etc.)

  2. The query remains the same but I can somehow get an accurate prediction of execution time (even if it's hours long). This way, at least, I can manage the expectations of the team. I understand that without actually running the query it can't be expected to know the times, but is there an easy way to "convert" these arbitrary units into some millisecond execution time?

Edit for Jim Jones' comment:

I executed the following query:

SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid

I got 9 identical rows like the following:

pid   | locktype | mode            | query       | query-start         | state
-------------------------------------------------------------------------
23192 | relation | AccessShareLock | <see below> | 2021-10-26 14:10:01 | active

query column:

--update parameter
--set "order" = e."order"
--from "event" e
--where "eventId" = e.id
SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid

Edit 2: I think I've been stupid here... The query produced by checking these locks is just the commented query. I think that means there's actually nothing to report.

CodePudding user response:

If some rows already have the target value, you can skip empty updates (at full cost). Like:

UPDATE parameter p
SET    "order" = e."order"
FROM   event e
WHERE  p."eventId" = e.id
AND    p."order" IS DISTINCT FROM e."order";  -- this

If both "order" columns are defined NOT NULL, simplify to:

...
AND    p."order" <> e."order";

See:

If you have to update all or most rows - and can afford it! - writing a new table may be cheaper overall, like Mike already mentioned. But concurrency and depending objects may stand in the way.

Aside: use legal, lower-case identifiers, so you don't have to double-quote. Makes your life with Postgres easier.

CodePudding user response:

The query will be slow because for each UPDATE operation, it has to look up the index by id. Even with an index, on a large table, this is a per-row read/write so it is slow.

I'm not sure how to get a good estimate, maybe do 1% of the table and multiply?

I suggest creating a new table, then dropping the old one and renaming the new table.

CREATE TABLE parameter_new AS
SELECT
  parameter.id,
  parameter."eventId",
  e."order"
FROM
  parameter
JOIN event AS "e" ON
  "e".id = parameter."eventId"

Later, once you verify things:

ALTER TABLE parameter RENAME TO parameter_old;
ALTER TABLE parameter_new RENAME TO parameter;

Later, once you're completely certain:

DROP TABLE parameter_old;
  • Related