Postgres Optimizing Free Text Search when many results are returned-CodePudding

We are building lightweight text search on top of our data at Postgres with GIN indexes. When the matched data is small, it works, really fast. However, if we search common terms, due to many matches, the performance of it degrades significantly.

Consider the following query:

EXPLAIN ANALYZE
  SELECT count(id)
  FROM data_change_records d
  WHERE to_tsvector('english', d.content) @@ websearch_to_tsquery('english', 'mustafa');

The result is as follows:

Finalize Aggregate  (cost=47207.99..47208.00 rows=1 width=8) (actual time=15.461..17.129 rows=1 loops=1)
  ->  Gather  (cost=47207.78..47207.99 rows=2 width=8) (actual time=9.734..17.119 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial Aggregate  (cost=46207.78..46207.79 rows=1 width=8) (actual time=3.773..3.774 rows=1 loops=3)
              ->  Parallel Bitmap Heap Scan on data_change_records d  (cost=1759.41..46194.95 rows=5130 width=37) (actual time=1.765..3.673 rows=1143 loops=3)
                    Recheck Cond: (to_tsvector('english'::regconfig, content) @@ '''mustafa'''::tsquery)"
                    Heap Blocks: exact=2300
                    ->  Bitmap Index Scan on data_change_records_content_to_tsvector_idx  (cost=0.00..1756.33 rows=12311 width=0) (actual time=4.219..4.219 rows=3738 loops=1)
                          Index Cond: (to_tsvector('english'::regconfig, content) @@ '''mustafa'''::tsquery)"
Planning Time: 0.141 ms
Execution Time: 17.163 ms

If the query is simple, like mustafa replaced with aws, which reduced to aw with tokenizer the analysis is as follows:

Finalize Aggregate  (cost=723889.39..723889.40 rows=1 width=8) (actual time=1073.513..1086.414 rows=1 loops=1)
  ->  Gather  (cost=723889.17..723889.38 rows=2 width=8) (actual time=1069.439..1086.401 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial Aggregate  (cost=722889.17..722889.18 rows=1 width=8) (actual time=1063.847..1063.848 rows=1 loops=3)
              ->  Parallel Bitmap Heap Scan on data_change_records d  (cost=17128.34..721138.59 rows=700233 width=37) (actual time=389.347..1014.440 rows=542724 loops=3)
                    Recheck Cond: (to_tsvector('english'::regconfig, content) @@ '''aw'''::tsquery)"
                    Heap Blocks: exact=167605
                    ->  Bitmap Index Scan on data_change_records_content_to_tsvector_idx  (cost=0.00..16708.20 rows=1680560 width=0) (actual time=282.517..282.518 rows=1647916 loops=1)
                          Index Cond: (to_tsvector('english'::regconfig, content) @@ '''aw'''::tsquery)"
Planning Time: 0.150 ms
Execution Time: 1086.455 ms

At this point, we are not sure how to proceed in this case. Options include changing the tokenization to not allow 2 words. We have lots of aws indexed that is the cause. For instance, if we search for ok which is also 2 words but not that common, the query returns in 61.378 ms

CodePudding user response：

Searching for frequent words can never be as fast as searching for rare words.

One thing that strikes me is that you are using English stemming to search for names. If that is really your use case, you should use the simple dictionary that wouldn't stem aws to aw, which could result in fewer rows eliminated by the recheck.

Alternatively, you could introduce an additional synonym dictionary to a custom text search configuration that contains aws and prevents stemming.

But, as I said, searching for frequent words cannot be fast if you want all result rows. A trick you could use is to set gin_fuzzy_search_limit to the limit of hits you want to find, then the index scan will stop early and may be faster (but you won't get all results).

CodePudding user response：

If you have a new-enough version of PostgreSQL and your table is well-vacuumed, you can get an bitmap-only scan which doesn't need to visit the table, just the index. But, you would need to use count(*), not count(id), to get that. If "id" is never NULL, then these should give identical answers.

The query plan does not make it easy to tell when the bitmap-only optimization kicks in or how effective it is. If you use EXPLAIN (ANALYZE, BUFFERS) you should get at least some clue based on the buffer counts.