As far as I'm aware, there is no simple, quick solution. I am trying to do a full-text keyword or semantic search, which is a very advanced topic. There are dedicated search servers created specifically for that reason, but still is there a way that I can implement for a query execution time for less than a second?
Here's what I have tried so far:
begin;
SET pg_trgm.similarity_threshold = 0.3;
select
id, <col_name>
similarity(<column with gin index>,
'<text to be searched>') as sml
from
<table> p
where
<clauses> and
<indexed_col> % '<text to be searched>'
and indexed_col <-> '<text to be searched>' < 0.5
order by
indexed_col <-> '<text to be searched>'
limit 10;
end;
Index created is as follows:
CREATE INDEX trgm_idx ON posts USING gin (post_title_combined gin_trgm_ops);
The above query takes around 6-7 secs to execute and sometimes even 200 ms which is weird to me because it changes the query plan according to the input I pass in for similarity.
I tried ts_vector @@ ts_query, but they turn out to be too strict due to &
operator.
EDIT: Here's the EXPLAIN ANALYZE of the above query
-> Sort (cost=463.82..463.84 rows=5 width=321) (actual time=3778.726..3778.728 rows=0 loops=1)
Sort Key: ((post_title_combined <-> 'Test text not to be disclosed'::text))
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on posts p (cost=404.11..463.77 rows=5 width=321) (actual time=3778.722..3778.723 rows=0 loops=1)
Recheck Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Rows Removed by Index Recheck: 36258
Filter: ((content IS NOT NULL) AND (is_crawlable IS TRUE) AND (score IS NOT NULL) AND (status = 1) AND ((post_title_combined <-> 'Test text not to be disclosed'::text) < '0.5'::double precision))
Heap Blocks: exact=24043
-> Bitmap Index Scan on trgm_idx (cost=0.00..404.11 rows=15 width=0) (actual time=187.394..187.394 rows=36916 loops=1)
Index Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Planning Time: 8.782 ms
Execution Time: 3778.787 ms```
CodePudding user response:
Your redundant/overlapping query conditions aren't helpful. Setting similarity_threshold=0.3 then doing
t % q and t <-> q < 0.5
just throws away index selectivity for no reason. Set similarity_threshold to as stringent of a value as you want to use, then get rid of the unnecessary <->
condition.
You could try the GiST version of trigram indexing. I can support the ORDER BY ... <-> ... LIMIT 10
operation directly from the index. I doubt it will be very effective with 2000 char strings, but it is worth a try.