Home > Software engineering >  Optimizing psql query that's using regex on text search
Optimizing psql query that's using regex on text search

Time:12-17

In psql I have the following query.

Suggestions on how to speed it up/optimize it?

I've tried various indexes on title and headline but they aren't getting used.

"SELECT \"people\".* FROM \"people\" WHERE (((TITLE IS NOT NULL AND title ~* '(^| )(one|two|three)( |$|,)' AND title !~* '(^| )(four|five|six)( |$|,)') OR (TITLE IS NULL AND headline ~* '(^| )(one|two|three)( |$|,)' AND headline !~* '(^| )(four|five|six)( |$|,)')) AND ((TITLE IS NOT NULL AND title ~* '(^| )(seven|eight|nine)( |$|,)' AND title !~* '(^| )(ten|eleven)( |$|,)') OR (TITLE IS NULL AND headline ~* '(^| )(seven|eight|nine)( |$|,)' AND headline !~* '(^| )(ten|eleven)( |$|,)')))"

Here's the EXPLAIN:

 Gather  (cost=1000.00..286343.58 rows=61760 width=715)                                                                                                                                                                               
   Workers Planned: 2                                                                    
   ->  Parallel Seq Scan on people  (cost=0.00..279167.58 rows=25733 width=715)                                                                                                                                                       
         Filter: ((((title IS NOT NULL) AND ((title)::text ~* '(^| )(one|two|three)( |$|,)'::text) AND ((title)::text !~* '(^| )(four|five|six)( |$|,)'::text)) OR ((title IS NULL) AND ((headline)::text ~* '(^| )(one|two|three)( |$|,)'::text) AND ((headline)::text !~* '(^| )(four|five|six)( |$|,)'::text))) AND (((title IS NOT NULL) AND ((title)::text ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND ((title)::text !~* '(^| )(ten|eleven)( |$|,)'::text)) OR ((title IS NULL) AND ((headline)::text ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND ((headline)::text !~* '(^| )(ten|eleven)( |$|,)'::text))))
 JIT:                                                                     
   Functions: 2                                         
   Options: Inlining false, Optimization false, Expressions true, Deforming true                                                                                                                                                      
(7 rows)     

CodePudding user response:

Traditional relational databases won't use an index on a column unless the leading part of the column is specified in the condition, ie:

... where my_column like 'FOO%' -- will (usually) use index
... where my_column like '%FOO%' -- will (usually) not use index

To efficiently search for terms within content you need a text-based search technology.

Fortunately, postgres provides support for full text search, which will give you great performance, and convenient syntax, for your task.

CodePudding user response:

If I create a trigram index on the right columns, it will support this directly.

Much to my surprise, it was actually pretty efficient as well. That is not always a given, some regexp cannot be decomposed to an efficient set of trigrams.

create extension pg_trgm;
create index on people using gin (title gin_trgm_ops, headline gin_trgm_ops);

Gives this sub-millisecond plan for million row table:

 Bitmap Heap Scan on people  (cost=547.25..551.28 rows=1 width=12) (actual time=0.741..0.743 rows=1 loops=1)
   Recheck Cond: (((title ~* '(^| )(one|two|three)( |$|,)'::text) OR (headline ~* '(^| )(one|two|three)( |$|,)'::text)) AND ((title ~* '(^| )(seven|eight|nine)( |$|,)'::text) OR (headline ~* '(^| )(seven|eight|nine)( |$|,)'::text)))
   Filter: ((((title IS NOT NULL) AND (title ~* '(^| )(one|two|three)( |$|,)'::text) AND (title !~* '(^| )(four|five|six)( |$|,)'::text)) OR ((title IS NULL) AND (headline ~* '(^| )(one|two|three)( |$|,)'::text) AND (headline !~* '(^| )(four|five|six)( |$|,)'::text))) AND (((title IS NOT NULL) AND (title ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND (title !~* '(^| )(ten|eleven)( |$|,)'::text)) OR ((title IS NULL) AND (headline ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND (headline !~* '(^| )(ten|eleven)( |$|,)'::text))))
   Rows Removed by Filter: 2
   Heap Blocks: exact=1
   ->  BitmapAnd  (cost=547.25..547.25 rows=1 width=0) (actual time=0.701..0.702 rows=0 loops=1)
         ->  BitmapOr  (cost=241.50..241.50 rows=200 width=0) (actual time=0.395..0.395 rows=0 loops=1)
               ->  Bitmap Index Scan on people_title_headline_idx  (cost=0.00..120.75 rows=100 width=0) (actual time=0.208..0.208 rows=80 loops=1)
                     Index Cond: (title ~* '(^| )(one|two|three)( |$|,)'::text)
               ->  Bitmap Index Scan on people_title_headline_idx  (cost=0.00..120.75 rows=100 width=0) (actual time=0.186..0.186 rows=60 loops=1)
                     Index Cond: (headline ~* '(^| )(one|two|three)( |$|,)'::text)
         ->  BitmapOr  (cost=305.50..305.50 rows=200 width=0) (actual time=0.301..0.301 rows=0 loops=1)
               ->  Bitmap Index Scan on people_title_headline_idx  (cost=0.00..152.75 rows=100 width=0) (actual time=0.145..0.145 rows=3 loops=1)
                     Index Cond: (title ~* '(^| )(seven|eight|nine)( |$|,)'::text)
               ->  Bitmap Index Scan on people_title_headline_idx  (cost=0.00..152.75 rows=100 width=0) (actual time=0.156..0.156 rows=2 loops=1)
                     Index Cond: (headline ~* '(^| )(seven|eight|nine)( |$|,)'::text)

While without the index it takes 500 ms.

But if every row matches on the positive expressions, but is then ruled out by also matching the negative ones (!~*) then no index will help you.

  • Related