Which Postgresql index is most efficient for text column with queries based on similarity-CodePudding

I would like to create an index on text column for the following use case. We have a table of Segment with a column content of type text. We perform queries based on the similarity by using pg_trgm. This is used in a translation editor for finding similar strings. Here are the table details:

CREATE TABLE public.segments
(
  id integer NOT NULL DEFAULT nextval('segments_id_seq'::regclass),
  language_id integer NOT NULL,
  content text NOT NULL,
  created_at timestamp without time zone NOT NULL,
  updated_at timestamp without time zone NOT NULL,
  CONSTRAINT segments_pkey PRIMARY KEY (id),
  CONSTRAINT segments_language_id_fkey FOREIGN KEY (language_id)
      REFERENCES public.languages (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE CASCADE,
  CONSTRAINT segments_content_language_id_key UNIQUE (content, language_id)
)

And here is the query (Ruby Hanami):

def find_by_segment_match(source_text_for_lookup, source_lang, sim_score)
    aggregate(:translation_records)
      .where(language_id: source_lang)
      .where { similarity(:content, source_text_for_lookup) > sim_score/100.00 }
      .select_append { float::similarity(:content, source_text_for_lookup).as(:similarity) }
      .order { similarity(:content, source_text_for_lookup).desc }
end

---EDIT---

This is the query:

SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity" FROM "segments" WHERE (("language_id" = 2) AND (similarity("content", 'This will not work.') > 0.45)) ORDER BY SIMILARITY("content", 'This will not work.') DESC

SELECT "translation_records"."id", "translation_records"."source_segment_id", "translation_records"."target_segment_id", "translation_records"."domain_id",
"translation_records"."style_id",
"translation_records"."created_by", "translation_records"."updated_by", "translation_records"."project_name", "translation_records"."created_at", "translation_records"."updated_at", "translation_records"."language_combination", "translation_records"."uid",
"translation_records"."import_comment" FROM "translation_records" INNER JOIN "segments" ON ("segments"."id" = "translation_records"."source_segment_id") WHERE ("translation_records"."source_segment_id" IN (27548)) ORDER BY "translation_records"."id"

---END EDIT---

---EDIT 1---

What about re-indexing? Initially we'll import about 2 million legacy records. When and how often, if at all, should we rebuild the index?

---END EDIT 1---

Would something like CREATE INDEX ON segment USING gist (content) be ok? I can't really find which of the available indices would be best suitable for our use case.

Best, seba

CodePudding user response：

CREATE INDEX segment_language_id_idx ON segment USING btree (language_id);
CREATE INDEX segment_content_gin ON segment USING gin (content gin_trgm_ops);

CodePudding user response：

The 2nd query you show seems to be unrelated to this question.

Your first query can't use a trigram index, as the query would have to be written in operator form, not function form, to do that.

In operator form, it would look like this:

SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity" 
FROM segments 
WHERE language_id = 2 AND content % 'This will not work.'
ORDER BY content <-> 'This will not work.';

In order for % to be equivalent to similarity("content", 'This will not work.') > 0.45, you would first need to do a set pg_trgm.similarity_threshold TO 0.45;.

Now how you get ruby/hanami to generate this form, I don't know.

The % operator can be supported by either the gin_trgm_ops index or the gist_index_ops index. The <-> can only be supported by gist_trgm_ops. But it is pretty hard to predict how efficient that support will be. If your "contents" column is long or your text to compare is long, it is unlikely to be very efficient, especially in the case of gist.

Ideally you would partition your table by language_id. If not, then it might be helpful to build a multicolumn index having both columns.