Home > Software engineering >  Flexible search users by full name in elasticsearch
Flexible search users by full name in elasticsearch

Time:05-19

I need to provide flexible search by full name with the following requirements:

  1. Possible to search by name
  2. Possible to search by last name
  3. Possible to search by name and last name and vice versa
  4. Possible to search by partial name or last name

As input I have only string, so it doesn't matter is it name or last name. So I decided to use edge ngram tokenizer and support search for umlauts.

I have the following index:

DELETE test.full.name

PUT test.full.name

{
    "settings": {
        "index": {
            "number_of_shards": "1",
            "analysis": {
                "filter": {
                    "edge_ngram_tokenizer": {
                        "token_chars": [
                            "letter",
                            "digit"
                        ],
                        "min_gram": "3",
                        "type": "edge_ngram",
                        "max_gram": "3"
                    }
                },
                "analyzer": {
                    "edge_ngram_multi_lang": {
                        "filter": [
                            "lowercase",
                            "german_normalization",
                            "edge_ngram_tokenizer"
                        ],
                        "tokenizer": "standard"
                    }
                }
            },
            "number_of_replicas": "1"
        }
    },
    "mappings": {
      "properties": {
        "fullName": {
          "type": "text",
          "analyzer": "edge_ngram_multi_lang"
        }
      }
  }
}

And create a few documents with data:

POST test.full.name/_doc
{
    "fullName": "Ruslan test"
}

POST test.full.name/_doc
{
    "fullName": "Russell test"
}

POST test.full.name/_doc
{
    "fullName": "Rust test"
}

Query search is:

GET test.full.name/_search
{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "fullName": {
                                        "query": "ruslan",
                                        "operator": "and"
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

It returns all three documents, but it must return only documents where ruslan value exists.

And the next search query:

GET test.full.name/_search
{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "fullName": {
                                        "query": "ruslan test",
                                        "operator": "and"
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

It returns three documents also, but expected only document with "ruslan test". Also it should be possible to find users by full name in any order of query search and of course partial search should work too like search by "rus" should return all documents that have in fullName such value.

Also query with "Ruslan test" should returns documents with "test ruslan", "ruslan test" and the same is true for query "test ruslan".

So how should be index configured to accept above requirements?

CodePudding user response:

You are using edge_ngram_tokenizer, which according to your index setting, will produce N-grams with a minimum length of 3 and a maximum length of 3. You can test this by using Analyze API :

GET /_analyze
{
  "analyzer" : "edge_ngram_multi_lang",
  "text" : "Ruslan test"
}

The tokens generated are :

{
    "tokens": [
        {
            "token": "rus",
            "start_offset": 0,
            "end_offset": 6,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "tes",
            "start_offset": 7,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

Since this is not your requirement, you should use Shingle token filter instead of Edge-ngram


Adding a working example with index mapping, search query, and search result

Index Mapping:

{
    "settings": {
        "index": {
            "number_of_shards": "1",
            "analysis": {
                "filter": {
                    "my_shingle_filter": {
                        "type": "shingle",
                        "min_shingle_size": 2,
                        "max_shingle_size": 3
                    }
                },
                "analyzer": {
                    "edge_ngram_multi_lang": {
                        "filter": [
                            "lowercase",
                            "german_normalization",
                            "my_shingle_filter"
                        ],
                        "tokenizer": "standard"
                    }
                }
            },
            "number_of_replicas": "1"
        }
    },
    "mappings": {
        "properties": {
            "fullName": {
                "type": "text",
                "analyzer": "edge_ngram_multi_lang"
            }
        }
    }
}

The token generated now will be

{
    "tokens": [
        {
            "token": "ruslan",
            "start_offset": 0,
            "end_offset": 6,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "ruslan test",
            "start_offset": 0,
            "end_offset": 11,
            "type": "shingle",
            "position": 0,
            "positionLength": 2
        },
        {
            "token": "test",
            "start_offset": 7,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

Search API:

{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "fullName": {
                                        "query": "test Ruslan",
                                        "operator": "and"
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

Search Result:

"hits": [
            {
                "_index": "my-idx",
                "_id": "4",
                "_score": 0.9150312,
                "_source": {
                    "fullName": "test Ruslan"
                }
            },
            {
                "_index": "my-idx",
                "_id": "1",
                "_score": 0.88840073,
                "_source": {
                    "fullName": "Ruslan test"
                }
            }
        ]

Update 1:

If the partial search is also your requirement then you should go for Search-as-you field type

But you can also use the same index mapping setting as defined in the answer above (since we are already using shingles). But you need to modify your search query as:

{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "multi_match": {
                                    "query": "rusl",
                                    "type": "bool_prefix",
                                    "fields": [
                                        "fullName",
                                        "fullName._2gram",
                                        "fullName._3gram"
                                    ],
                                    "operator": "AND"
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

The above index mapping and setting can be used to achieve all of the test scenarios indicated in the question.

  • Related