Is there a way to get ElasticSearch to create n-gram tokens from truncated field?-CodePudding

Documents contain a url field with a full url. Users should be able to search for documents containing a given url by supplying a portion of the url string. The search string can be 3-15 characters long. An N-gram token filter with min_gram of 3 and max_gram of 15 would work but generates a large number of tokens for long urls. Is it possible to have ElasticSearch only generate tokens for the first 100 characters of the url field?

For example, the user should be able to search for documents containing the following url using a search string such as ’example.com’ or ‘/foo/bar’.

https://click.example.com/foo/bar/55gft/?qs=1952934d0ee8e2368ec7f7a921e3c6202b39365b9a2d26774c8122b8555ca21fce9d2344fc08a8ba40caede5e6901a112c6e89ead40892109eb8290d70571eab

CodePudding user response：

There are two ways to achieve what you want.

Option 1: Keep using ngrams as you do now, but insert a truncate token filter before the ngram one, to limit the url size to 100 and only after ngram it.

Option 2: Use the wildcard field type, which has been created exactly for cases like this.

In your index, you should first change the type of the URL field to wildcard:

PUT test 
{
  "mappings": {
    "properties": {
      "url": {
        "type": "wildcard"
      }
    }
  }
}

Then, you can search on that field, using the wildcard query, like this:

POST test/_search 
{
  "query": {
    "wildcard": {
      "url": "*foo/bar*"
    }
  }
}

Also, read the related blog post which shows in details how the wildcard field type performs.