postgres query slow perf: not in faster than prop equal-CodePudding

So I have this query on a pretty big table:

SELECT * FROM datTable WHERE type='bla' 
AND timestamp > (CURRENT_DATE - INTERVAL '1 day')

This query is too slow, like 5 seconds; and there is an index on type

So I tried:

SELECT * FROM datTable WHERE type NOT IN ('blu','bli','blo') 
AND timestamp > (CURRENT_DATE - INTERVAL '1 day')

This query is way better like 1second, but the issue is that I don't want this not type list hardcoded.

So I tried:

with res as (
    SELECT * FROM datTable WHERE type NOT IN ('blu','bli','blo') 
    AND timestamp > (CURRENT_DATE - INTERVAL '1 day')
)
select * from res where type='bla'

And I'm back to bad perf, 5 seconds same as before.

Any idea how I could trick postgres to get the 1sec perf but specifying positively the type I want ('bla') ?

EDIT: EXPLAIN ANALYZE for the last request

GroupAggregate  (cost=677400.59..677493.09 rows=3595 width=59) (actual time=4789.667..4803.183 rows=3527 loops=1)
  Group Key: event_historic.sender
  ->  Sort  (cost=677400.59..677412.48 rows=4756 width=23) (actual time=4789.646..4792.808 rows=68045 loops=1)
        Sort Key: event_historic.sender
        Sort Method: quicksort  Memory: 9469kB
        ->  Bitmap Heap Scan on event_historic  (cost=505379.21..677110.11 rows=4756 width=23) (actual time=4709.494..4769.437 rows=68045 loops=1)
              Recheck Cond: (("timestamp" > (CURRENT_DATE - '1 day'::interval)) AND ((type)::text = 'NEAR_TRANSFER'::text))
              Heap Blocks: exact=26404
              ->  BitmapAnd  (cost=505379.21..505379.21 rows=44676 width=0) (actual time=4706.080..4706.082 rows=0 loops=1)
                    ->  Bitmap Index Scan on event_historic_timestamp_idx  (cost=0.00..3393.89 rows=263109 width=0) (actual time=167.838..167.838 rows=584877 loops=1)
                          Index Cond: ("timestamp" > (CURRENT_DATE - '1 day'::interval))
                    ->  Bitmap Index Scan on event_historic_type_idx  (cost=0.00..501982.69 rows=45316549 width=0) (actual time=4453.071..4453.071 rows=44279973 loops=1)
                          Index Cond: ((type)::text = 'NEAR_TRANSFER'::text)
Planning Time: 0.385 ms
JIT:
  Functions: 10
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 2.505 ms, Inlining 18.102 ms, Optimization 87.745 ms, Emission 44.270 ms, Total 152.622 ms
Execution Time: 4809.099 ms

EDIT 2: After adding the index on (type, timestamp) the result is way faster:

HashAggregate  (cost=156685.88..156786.59 rows=8057 width=59) (actual time=95.201..96.511 rows=3786 loops=1)
  Group Key: sender
  Batches: 1  Memory Usage: 2449kB
  Buffers: shared hit=31041
  ->  Index Scan using typetimestamp on event_historic eh  (cost=0.57..156087.67 rows=47857 width=44) (actual time=12.244..55.921 rows=76220 loops=1)
        Index Cond: (((type)::text = 'NEAR_TRANSFER'::text) AND ("timestamp" > (CURRENT_DATE - '1 day'::interval)))
        Buffers: shared hit=31041
Planning:
  Buffers: shared hit=5
Planning Time: 0.567 ms
JIT:
  Functions: 10
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 2.543 ms, Inlining 0.000 ms, Optimization 1.221 ms, Emission 10.819 ms, Total 14.584 ms
Execution Time: 99.496 ms

CodePudding user response：

You need a two-column index on ((type::text), timestamp) to make that query fast.

Let me explain the reasoning behind the index order in detail. If type is first in the index, the index scan can start with the first index entry after ('NEAR_TRANSFER', <now - 1 day>) and scan all index entries until it hits the next type, so all the index entries that are found correspond to a result row. If the index order is the other way around, the scan has to start at the first entry after (<now - 1 day>, ...) and read all index entries up to the end of the index. It discards the index entries where type IS DISTINCT FROM 'NEAR_TRANSFER' and fetches the table rows for the remaining index entries. So this scan will fetch the same number of table rows, but has to read more index entries.

It is an old myth that the most selective column should be the first in the index, but it is nonetheless a myth. For the reason described above, you should have the columns that are compared with = first in the index. The selectivity of the columns is irrelevant.

All this is speaking about a single query in isolation. But you always have to consider all the other queries in the workload, and for them it may make a difference how the columns are ordered.

CodePudding user response：

A single index on timestamp and type might be faster:

CREATE INDEX idx1 ON datTable (timestamp, type);

Or maybe:

CREATE INDEX idx1 ON datTable (type, timestamp);

Check the query plan if the new index is used. Maybe you have to drop an old one as well. And most likely you could drop the one anyway.