How do I implement a query parameter in Postgres based on a constant table's row?-CodePudding

I have the following query:

-- Really fast
-- Explain Analyze: https://explain.depesz.com/s/lsq8
with start_time as (
  select '2022-06-02T17:45:43Z':: timestamp with time zone as time
)

-- 200x slower
-- Explain Analyze: https://explain.depesz.com/s/CavD
with start_time as (
  select last_update as time
  from processed.last_update
  where table_name = 'token_supply'
)

select  ident as ma_id
    ,   sum(quantity) as quantity
    ,   sum(quantity) filter (where quantity > 0) as quantity_minted
from public.ma_tx_mint
where exists (
  select id
  from public.tx
  where tx.id = ma_tx_mint.tx_id
    and exists (
      select id
      from public.block
      cross join start_time
      where block.id = tx.block_id
        and block.time >= start_time.time
    )
)
group by ident

I am trying to query records added to a table after a specified time. If the time is hard coded like in the first start_time the query runs in 0.2 seconds. In the case of the second start_time where I dynamically retrieve the time, the query runs for 40 seconds.

How can I get Postgres to treat these two cases identically and dynamically query the ma_tx_mint table based on another table's row?

Version: PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit

Tables:

create table public.ma_tx_mint (
    id bigint
  , quantity numeric
  , tx_id bigint
  , ident bigint
  , primary key(id)
);

create table public.tx (
    id bigint
  , block_id bigint
  , primary key(id)
);

create table public.block (
    id bigint
  , time timestamp with time zone
  , primary key(id)
);

create table processed.last_update (
    table_name varchar
  , last_update timestamp with time zone
  , primary key(table_name)
);

Explain Analyze:

fast: https://explain.depesz.com/s/lsq8
slow: https://explain.depesz.com/s/CavD

CodePudding user response：

The problem

Postgres has column statistics including histogram bounds. Your constant timestamp (fast variant) seems to be close to the latest few rows, so Postgres knows to expect very few qualifying rows from table block. And that estimate turns out ok:

Index Scan using idx_block_time on block (cost=0.43..14.29 rows=163 width=8)
                                         (actual time=0.825..1.146 rows=891 loops=1)

With the subselect fetching a yet unknown timestamp (slow variant), Postgres does not know what to expect and plans for a timestamp somewhere in the middle. Unfortunately, your table block seems to have around 7.5 million rows, so assuming an unknown filter Postgres expect around 1/3 of all rows to qualify, i.e. ~ 2.5 million:

Index Scan using idx_block_time on block (cost=0.43..127,268.65 rows=2,491,327 width=16)
                                         (actual time=1.261..1.723 rows=653 loops=3)

So Postgres plans for millions of qualifying rows and uses sequential scans, which turn out to be a bad choice for the few rows actually qualifying.

Possible solution

If you know a lower bound for the timestamp, you can add that as (logically redundant) additional predicate to guide Postgres to a more suitable plan:

SELECT ident AS ma_id
     , sum(quantity) AS quantity
     , sum(quantity) FILTER (WHERE quantity > 0) AS quantity_minted
FROM   public.ma_tx_mint m
WHERE  EXISTS (
   SELECT FROM public.tx
   WHERE  tx.id = m.tx_id
   AND    EXISTS (
      SELECT FROM public.block b
      WHERE  b.id = tx.block_id
      AND    b.time >= (SELECT last_update FROM processed.last_update WHERE table_name = 'token_supply')
      AND    b.time >= '2022-06-01 00:00 0'  -- !!! some known minimum bound
      )
   )
GROUP  BY 1;

Also, since table_name is the PK of processed.last_update we know that the suquery only returns a single row, and we can use a simple scalar subquery. Should be a bit faster already.

But the main point is the added minimum bound. If that is selective enough, Postgres will know to switch to index scans like in your fast plan.

Aside

Casting timestamp constants to timestamptz is typically a bad idea:

'2022-06-02T17:45:43Z'::timestamptz

This will assume the time zone of the current session, which may or may not be as intended. Rather be explicit:

'2022-06-02T17:45:43Z'::timestamp AT TIME ZONE 'UTC'
'2022-06-02T17:45:43Z 0'::timestamptz

.. or whatever time zone you actually want to work with. See:

Ignoring time zones altogether in Rails and PostgreSQL