Home > Software engineering >  Elasticsearch: Ngram tokenizer perfomance
Elasticsearch: Ngram tokenizer perfomance

Time:06-23

I try to search documents in elasticsearch with simple query like this:

{
    "query": {
        "match": { "name": "Test name" }
    }
}

and i have about 70 million documents in the index. I used whitespace tokenizer before, and it's works ok. But now i'm start using ngram, and even this queue runs for at least 6-7 seconds. I create index like this:

{
    "mappings": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer": "custom_analyzer"
            }
        }
    },
    "settings": {
        "analysis": {
            "tokenizer": {
                "custom_tokenizer": {
                    "token_chars": [
                        "letter",
                        "digit",
                        "symbol",
                        "punctuation"
                    ],
                    "min_gram": "2",
                    "type": "ngram",
                    "max_gram": "3"
                }
            },
            "analyzer": {
                "custom_analyzer": {
                    "filter": [
                        "lowercase"
                    ],
                    "type": "custom",
                    "tokenizer": "custom_tokenizer"
                }
            }
        }
    }
}

Are there any ways to optimize the search? Or is ngram really that slow?

CodePudding user response:

Yes, ngram is known to cause performance issues, as it creates many tokens, increase the Elasticsearch index size and search terms to match, one way to improve the performance is to use only in the queries where you really need it, for example in in-fix queries, if you share your search use-case, community might offer some other better alternatives.

  • Related