Documents contain a url field with a full url. Users should be able to search for documents containing a given url by supplying a portion of the url string. The search string can be 3-15 characters long. An N-gram token filter with min_gram of 3 and max_gram of 15 would work but generates a large number of tokens for long urls. Is it possible to have ElasticSearch only generate tokens for the first 100 characters of the url field?
For example, the user should be able to search for documents containing the following url using a search string such as ’example.com’ or ‘/foo/bar’.
CodePudding user response:
There are two ways to achieve what you want.
Option 1: Keep using ngrams as you do now, but insert a truncate
token filter before the ngram one, to limit the url size to 100 and only after ngram it.
Option 2: Use the wildcard
field type, which has been created exactly for cases like this.
In your index, you should first change the type of the URL field to wildcard
:
PUT test
{
"mappings": {
"properties": {
"url": {
"type": "wildcard"
}
}
}
}
Then, you can search on that field, using the wildcard query, like this:
POST test/_search
{
"query": {
"wildcard": {
"url": "*foo/bar*"
}
}
}
Also, read the related blog post which shows in details how the wildcard
field type performs.