I have the following query:
-- Really fast
-- Explain Analyze: https://explain.depesz.com/s/lsq8
with start_time as (
select '2022-06-02T17:45:43Z':: timestamp with time zone as time
)
-- 200x slower
-- Explain Analyze: https://explain.depesz.com/s/CavD
with start_time as (
select last_update as time
from processed.last_update
where table_name = 'token_supply'
)
select ident as ma_id
, sum(quantity) as quantity
, sum(quantity) filter (where quantity > 0) as quantity_minted
from public.ma_tx_mint
where exists (
select id
from public.tx
where tx.id = ma_tx_mint.tx_id
and exists (
select id
from public.block
cross join start_time
where block.id = tx.block_id
and block.time >= start_time.time
)
)
group by ident
I am trying to query records added to a table after a specified time. If the time is hard coded like in the first start_time
the query runs in 0.2 seconds. In the case of the second start_time
where I dynamically retrieve the time, the query runs for 40 seconds.
How can I get Postgres to treat these two cases identically and dynamically query the ma_tx_mint
table based on another table's row?
Version: PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit
Tables:
create table public.ma_tx_mint (
id bigint
, quantity numeric
, tx_id bigint
, ident bigint
, primary key(id)
);
create table public.tx (
id bigint
, block_id bigint
, primary key(id)
);
create table public.block (
id bigint
, time timestamp with time zone
, primary key(id)
);
create table processed.last_update (
table_name varchar
, last_update timestamp with time zone
, primary key(table_name)
);
Explain Analyze
:
fast: https://explain.depesz.com/s/lsq8
slow: https://explain.depesz.com/s/CavD
CodePudding user response:
The problem
Postgres has column statistics including histogram bounds. Your constant timestamp (fast variant) seems to be close to the latest few rows, so Postgres knows to expect very few qualifying rows from table block
. And that estimate turns out ok:
Index Scan using idx_block_time on block (cost=0.43..14.29 rows=163 width=8) (actual time=0.825..1.146 rows=891 loops=1)
With the subselect fetching a yet unknown timestamp (slow variant), Postgres does not know what to expect and plans for a timestamp somewhere in the middle. Unfortunately, your table block
seems to have around 7.5 million rows, so assuming an unknown filter Postgres expect around 1/3 of all rows to qualify, i.e. ~ 2.5 million:
Index Scan using idx_block_time on block (cost=0.43..127,268.65 rows=2,491,327 width=16) (actual time=1.261..1.723 rows=653 loops=3)
So Postgres plans for millions of qualifying rows and uses sequential scans, which turn out to be a bad choice for the few rows actually qualifying.
Possible solution
If you know a lower bound for the timestamp, you can add that as (logically redundant) additional predicate to guide Postgres to a more suitable plan:
SELECT ident AS ma_id
, sum(quantity) AS quantity
, sum(quantity) FILTER (WHERE quantity > 0) AS quantity_minted
FROM public.ma_tx_mint m
WHERE EXISTS (
SELECT FROM public.tx
WHERE tx.id = m.tx_id
AND EXISTS (
SELECT FROM public.block b
WHERE b.id = tx.block_id
AND b.time >= (SELECT last_update FROM processed.last_update WHERE table_name = 'token_supply')
AND b.time >= '2022-06-01 00:00 0' -- !!! some known minimum bound
)
)
GROUP BY 1;
Also, since table_name
is the PK of processed.last_update
we know that the suquery only returns a single row, and we can use a simple scalar subquery. Should be a bit faster already.
But the main point is the added minimum bound. If that is selective enough, Postgres will know to switch to index scans like in your fast plan.
Aside
Casting timestamp
constants to timestamptz
is typically a bad idea:
'2022-06-02T17:45:43Z'::timestamptz
This will assume the time zone of the current session, which may or may not be as intended. Rather be explicit:
'2022-06-02T17:45:43Z'::timestamp AT TIME ZONE 'UTC'
'2022-06-02T17:45:43Z 0'::timestamptz
.. or whatever time zone you actually want to work with. See: