Home > Software engineering >  SELECT FOR UPDATE subquery not respecting LIMIT clause under load
SELECT FOR UPDATE subquery not respecting LIMIT clause under load

Time:10-06

I have the following query that uses the psql SKIP LOCKED feature to poll messages from a queue table sorted by the insertion time. My subquery also limits the result set by 10 since that is the maximum number of results that can be processed by the consumer.

DELETE FROM outbox_queue
WHERE id IN (
    SELECT id FROM outbox_queue
    WHERE delay_until IS NULL OR now() > delay_until
    ORDER BY enqueued_on, id
    LIMIT 10
    FOR UPDATE SKIP LOCKED
)
RETURNING *

Under normal load, everything works as expected. However when I apply load tests to my system, I will occasionally (<5%) get results the range from dozens of records to even several hundred records being returned. This is confusing to me since my understanding is that the subquery should have a limit placed on the number of records returned.

I've checked my logs and verified that the records being pulled are indeed unique and are not duplicates. Once load is reduced, the behavior returns to what I would expect.

My application has three workers that poll messages from this table, and each worker has a single thread that runs this query and processes the result.

Postgres version:

PostgreSQL 10.18 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (GCC) 7.4.0, 64-bit

What am I missing?

CodePudding user response:

The query plan for this particular combination can result in a Nested Loop, which can exceed the LIMIT set in the subquery. See:

Rows are locked after they have been returned from the subquery, and running this in a nested loop in combination with SKIP LOCKED can sidestep the LIMIT. I have not fully verified the exact mechanism to this day. Also, this may have been fixed in recent versions. Didn't rerun tests, yet.

Materializing the selection in a CTE reliably prevents the problem in any case:

WITH sel AS (
   SELECT id
   FROM   outbox_queue
   WHERE  delay_until >= now() IS NOT TRUE  -- simpler & may be faster
-- WHERE  delay_until IS NULL OR now() > delay_until
   ORDER  BY enqueued_on, id
   LIMIT  10
   FOR    UPDATE SKIP LOCKED
   )
DELETE FROM outbox_queue
WHERE  id IN (TABLE sel)
RETURNING *

While being at it I replaced:

delay_until IS NULL OR now() > delay_until

with:

delay_until >= now() IS NOT TRUE

Equivalent. Simpler. May be faster.

Postgres 12 added the MATERIALIZED option. Explicitly materialize to be sure:

WITH sel AS MATERIALIZED (  -- !
   SELECT id FROM outbox_queue
   WHERE  delay_until >= now() IS NOT TRUE
   ORDER  BY enqueued_on, id
   LIMIT  10
   FOR    UPDATE SKIP LOCKED
   )
DELETE FROM outbox_queue
WHERE  id IN (TABLE sel)
RETURNING *

The manual:

By default, a side-effect-free WITH query is folded into the primary query if it is used exactly once in the primary query's FROM clause. This allows joint optimization of the two query levels in situations where that should be semantically invisible. However, such folding can be prevented by marking the WITH query as MATERIALIZED. That might be useful, for example, if the WITH query is being used as an optimization fence to prevent the planner from choosing a bad plan. PostgreSQL versions before v12 never did such folding, so queries written for older versions might rely on WITH to act as an optimization fence.

  • Related