How can I improve/make stronger text fuzzy searching in Elasticsearch?-CodePudding

Below is my setup. I am inserting a user in ElasticSearch and I am doing weighted fuzziness username searches. The problem is that the fuzziness could be... fuzzier? I show you what I mean, this code is my mapping:

{
    "mappings": {
        "properties": {
            "user_id": {
                "enabled": false
            },
            "username": {
                "type": "text"
            },
            "d_likes": {
                "type": "rank_feature"
            }
        }
    }
}

And I am inserting 2 users:

user_id: random, username: pietje, d_likes: 3
user_id: random, username: p13tje, d_likes: 30

Now the problem is that I need to write a lot of characters in the username field to get hits. This is how I search:

{
  "query": {
    "bool": {
      "must": [
        {
            "match": {
              "username": {
                "query": "piet",
                "fuzziness": "auto"
              }
            }
        }
      ],
      "should": [
        {
          "rank_feature": {
            "field": "d_likes"
          }
        }
      ]
    }
  }
}

'piet' gives no results. That looks strange to me, I was hoping I would actually see both p13tje and pietje (in that order) because they are so similar. When my search query is pietj, I only get pietje and not p13tje.

So I was wondering how can I get more hits with the fuzziness search? I want autocompletion for usernames, this is pretty bad user expierence, because it only gives autocompletion when you have filled in most the characters. I just want the search to be more loose and give more results.

CodePudding user response：

ElasticSearch documentation:

When querying text or keyword fields, fuzziness is interpreted as a Levenshtein Edit Distance — the number of one character changes that need to be made to one string to make it the same as another string.

The Levenshtein Edit Distance essentially is a way of measuring the difference between 2 string values.

You've set the fuzziness parameter to AUTO, which is a great default decision. However, for some short strings like yours, it can prove to be not as fuzzy as you'd want it to be.

This is because ElasticSearch (ES) will generate an edit distance based on the length of the string, which will determine how many edits away the string in the index is from your search query.

You haven't specified any specific low or high values so for piet, as it's a 4 character string, only one edit will be allowed.

pietje is actually two edits away - piet needs a j as well as an e so it won't show up.

p13tje is actually four edits away - it needs a j, an e, a change from 1 to i & a change from 3 to e so it also won't show up.

The maximum allowed Levenshtein Edit Distance for ES fuzzy searching is 2 (larger differences are far more expensive to compute efficiently and are not processed by the Lucene search engine which ES is based on) so to fix this, set fuzziness to 2 manually.

"match": {
  "username": {
    "query": "piet",
    "fuzziness": "2"
  }
}

Hopefully, that will at least allow pietje to show up in the search and possibly even p13tje depending on if there are any other matches or not.

Instead of manually setting it to 2, you could also set the low and high distance arguments for AUTO however that will generate worse results (format is AUTO:[low],[high] e.g. AUTO:15,30).

For example, with a low of 8 and a high of 20:

Usernames with a character length of 8 or lower will not have any fuzzy searching as it will have to be an exact match
Usernames with a character length between 9 & 20 will only be allowed 1 edit
Usernames with a character length of 21 or higher will only be allowed 2 edit

You can try tweaking the low and high values if you'd like, but for the... fuzziest fuzziness, set the edit distance to the maximum allowed Levenshtein edit distance (2).