Home > Net >  Is there a similar approach as Postgres's ts_vector field for ElasticSearch?
Is there a similar approach as Postgres's ts_vector field for ElasticSearch?

Time:02-18

When using Postgres you can index a string in a database field as a vector using ts_vector. (https://www.postgresql.org/docs/10/datatype-textsearch.html#DATATYPE-TSVECTOR)

Is there a similar concept for ElasticSearch?

CodePudding user response:

It's pretty much what ES does under the hood when you index a string into a text field.

Let's take the first example from the link you provided: a fat cat sat on a mat and ate a fat rat

With the PG tsvector type, the following tokens are going to be analyzed and indexed

a and ate cat fat mat on rat sat

If you want to keep positions, you need to specify them, like this:

a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12

Whereas with ES, positions are kept automatically without having to specify them. It is also possible to tell ES to not record them (to save space)

With the ES text type and the standard analyzer, the following tokens are going to be analyzed and indexed

a fat cat sat on a mat and ate a fat rat

With the english analyzer, we get this (stopwords removed, words stemming, etc)

fat cat sat mat at fat rat

ES doesn't store the tokens alphabetically, it doesn't really help either with free-text search. Also it doesn't remove duplicates (although it is possible to do it) because that interferes with the token frequency in the document and in the index, hence the scoring.

Basically, both do index pretty much the same tokens, although ES is a search engine at heart and does it in a much more optimal way. When looking at the tsquery type, free text searches in ES are also a bit more user-friendly.

  • Related