ElasticSearch - Searching partial text in String-CodePudding

What is the best way to use ElasticSearch to search exact partial text in String?

In SQL the method would be: %PARTIAL TEXT%, %ARTIAL TEX%

In Elastic Search current method being used:

{
    "query": {
        "match_phrase_prefix": {
             "name": "PARTIAL TEXT"
        }
    }
}

However, it breaks whenever you remove first and last character of string as shown below (No results found):

{
    "query": {
        "match_phrase_prefix": {
             "name": "ARTIAL TEX"
        }
    }
}

CodePudding user response：

I believe that there will be numerous suggestions, such as the use of ngram analyzer, on how you can solve this problem. I believe the simplest would be to use fuzziness.

{
  "query": {
    "match": {
      "name": {
        "query": "artial tex",
        "operator": "and",
        "fuzziness": 1
      }
    }
  }
}

CodePudding user response：

There are multiple ways to do partial search and each comes with its own tradeoffs.

1. Wildcard

For wildcard perform search on "keyword" field instead of "text" .

{
  "query": {              
          "wildcard": {
            "name.keyword": "*artial tex*"          
        }
   }
}

Wild cards have poor performance, there are better alternatives.

2. Match/Match_phrase/Match_phrase_prefix

If you are searching for whole tokens like "PARTIAL TEXT". You can simply use a match query, all documents which contain tokens "PARTIAL" and "TEXT" will be returned.

If order of tokens matter, you can use match_phrase.

If you want to search for partial tokens, use match_phrase_prefix. Prefix match is only done on last token in search input ex. "partial tex"

This is not suitable for your use case, since you want to search anywhere.

3. N grams

The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.

N-grams are like a sliding window that moves across the word - a continuous sequence of characters of the specified length. They are useful for querying languages that don’t use spaces or that have long compound words, like German.

Query

{
  "settings": {
    "max_ngram_diff" : "5",
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 5,
          "max_gram": 7
        }
      }
    }
  }
}


POST index29/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Partial text"
}

Tokens Generated:

"tokens" : [
    {
      "token" : "Parti",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Partia",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Partial",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "artia",
      "start_offset" : 1,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "artial",
      "start_offset" : 1,
      "end_offset" : 7,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "artial ",
      "start_offset" : 1,
      "end_offset" : 8,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "rtial",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "rtial ",
      "start_offset" : 2,
      "end_offset" : 8,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "rtial t",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "tial ",
      "start_offset" : 3,
      "end_offset" : 8,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "tial t",
      "start_offset" : 3,
      "end_offset" : 9,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "tial te",
      "start_offset" : 3,
      "end_offset" : 10,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "ial t",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "ial te",
      "start_offset" : 4,
      "end_offset" : 10,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "ial tex",
      "start_offset" : 4,
      "end_offset" : 11,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : "al te",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "al tex",
      "start_offset" : 5,
      "end_offset" : 11,
      "type" : "word",
      "position" : 16
    },
    {
      "token" : "al text",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "word",
      "position" : 17
    },
    {
      "token" : "l tex",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 18
    },
    {
      "token" : "l text",
      "start_offset" : 6,
      "end_offset" : 12,
      "type" : "word",
      "position" : 19
    },
    {
      "token" : " text",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "word",
      "position" : 20
    }
  ]

You can do search on any of the tokens generated. You can also use "token_chars": [ "letter", "digit" ] to generate tokens excluding space.

Your choice of any of the option above will depend on your data size and performance requirements. Wildcard is more flexible but matching is done at run time hence perfomance is slow. If data size is small this will be ideal solution.

Ngrams, tokens are generated at time of indexing. It takes more memory but search is faster. For large data size this should be ideal solution.