I have the following query that uses the psql SKIP LOCKED
feature to poll messages from a queue table sorted by the insertion time. My subquery also limits the result set by 10 since that is the maximum number of results that can be processed by the consumer.
DELETE FROM outbox_queue
WHERE id IN (
SELECT id FROM outbox_queue
WHERE delay_until IS NULL OR now() > delay_until
ORDER BY enqueued_on, id
LIMIT 10
FOR UPDATE SKIP LOCKED
)
RETURNING *
Under normal load, everything works as expected. However when I apply load tests to my system, I will occasionally (<5%) get results the range from dozens of records to even several hundred records being returned. This is confusing to me since my understanding is that the subquery should have a limit placed on the number of records returned.
I've checked my logs and verified that the records being pulled are indeed unique and are not duplicates. Once load is reduced, the behavior returns to what I would expect.
My application has three workers that poll messages from this table, and each worker has a single thread that runs this query and processes the result.
Postgres version:
PostgreSQL 10.18 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (GCC) 7.4.0, 64-bit
What am I missing?
CodePudding user response:
The query plan for this particular combination can result in a Nested Loop
, which can exceed the LIMIT
set in the subquery. See:
Rows are locked after they have been returned from the subquery, and running this in a nested loop in combination with SKIP LOCKED
can sidestep the LIMIT
. I have not fully verified the exact mechanism to this day. Also, this may have been fixed in recent versions. Didn't rerun tests, yet.
Materializing the selection in a CTE reliably prevents the problem in any case:
WITH sel AS (
SELECT id
FROM outbox_queue
WHERE delay_until >= now() IS NOT TRUE -- simpler & may be faster
-- WHERE delay_until IS NULL OR now() > delay_until
ORDER BY enqueued_on, id
LIMIT 10
FOR UPDATE SKIP LOCKED
)
DELETE FROM outbox_queue
WHERE id IN (TABLE sel)
RETURNING *
While being at it I replaced:
delay_until IS NULL OR now() > delay_until
with:
delay_until >= now() IS NOT TRUE
Equivalent. Simpler. May be faster.
Postgres 12 added the MATERIALIZED
option. Explicitly materialize to be sure:
WITH sel AS MATERIALIZED ( -- !
SELECT id FROM outbox_queue
WHERE delay_until >= now() IS NOT TRUE
ORDER BY enqueued_on, id
LIMIT 10
FOR UPDATE SKIP LOCKED
)
DELETE FROM outbox_queue
WHERE id IN (TABLE sel)
RETURNING *
By default, a side-effect-free
WITH
query is folded into the primary query if it is used exactly once in the primary query'sFROM
clause. This allows joint optimization of the two query levels in situations where that should be semantically invisible. However, such folding can be prevented by marking theWITH
query asMATERIALIZED
. That might be useful, for example, if theWITH
query is being used as an optimization fence to prevent the planner from choosing a bad plan. PostgreSQL versions before v12 never did such folding, so queries written for older versions might rely onWITH
to act as an optimization fence.