Elasticsearch Term suggester is not returning correct suggestions when one character is missing (ins-CodePudding

I'm using Elasticsearch term suggester for spell correction. my index contains huge list of ads. Each ad has subject and body fields. I've found a problematic example for which the suggester is not suggesting correct suggestions.

I have lots of ads whose subject contains word "soffa" and also 5 ads whose subject contain word "sofa". Ideally, when I send "sofa" (wrong spelling) as text to suggester, it should return "soffa" (correct spelling) as suggestions (since soffa is correct spell and most of ads contains "soffa" and only few ads contains "sofa" (wrong spell)).

Here is my suggester query body :

{
  "suggest": {
    "text": "sofa",
    "subjectSuggester": {
      "term": {
        "field": "subject",
        "suggest_mode": "popular",
        "min_word_length": 1
      }
    }
  }
}

When I send above query, I get below response :

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "suggest": {
        "subjectSuggester": [
            {
                "text": "sof",
                "offset": 0,
                "length": 4,
                "options": [
                    {
                        "text": "soff",
                        "score": 0.6666666,
                        "freq": 298
                    },
                    {
                        "text": "sol",
                        "score": 0.6666666,
                        "freq": 101
                    },
                    {
                        "text": "saf",
                        "score": 0.6666666,
                        "freq": 6
                    }
                ]
            }
        ]
    }
}

As you see in above response, it returned "soff" but not "soffa" although I have lots of docs whose subject contains "soffa".

I even played with parameters like suggest_mode and string_distance but still no luck.

I also used phrase suggester instead of term suggester but still same. Here is my phrase suggester query :

{
    "suggest": {
        "text": "sofa",
        "subjectuggester": {
            "phrase": {
                "field": "subject",
                "size": 10,
                "gram_size": 3,
                "direct_generator": [
                    {
                        "field": "subject.trigram",
                        "suggest_mode": "always",
                        "min_word_length":1
                    }
                ]
            }
        }
    }
}

I somehow think it doesn't work when one character is missing instead of being misspelled. in the "soffa" example, one "f" is missing. while it works fine for misspells e.g it works fine for "vovlo". When I send "vovlo" it gives me "volvo".

Any help would be hugely appreciated.

CodePudding user response：

Try changing the "string_distance".

{
  "suggest": {
    "text": "sof",
    "subjectSuggester": {
      "term": {
        "field": "title",
        "min_word_length":2,
        "string_distance":"ngram"
      }
    }
  }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#term-suggester

CodePudding user response：

I've found the workaround myself. I added ngram filter and analyzer with max_shingle_size 3 which means trigram, then added a subfield with that analyzer (trigram) and performed suggester query on that field (instead of actual field) and it worked.

Here is the mapping changes :

{
    "settings": {
        "analysis": {
            "filter": {
                "shingle": {
                    "type": "shingle",
                    "min_shingle_size": 2,
                    "max_shingle_size": 3
                }
            },
            "analyzer": {
                "trigram": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "shingle"
                    ],
                    "char_filter": [
                        "diacritical_marks_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "subject": {
                "type": "text",
                "fields": {
                    "trigram": {
                        "type": "text",
                        "analyzer": "trigram"
                    }
                }
            }
        }
    }
}

And here is my corrected query :

{
  "suggest": {
    "text": "sofa",
    "subjectSuggester": {
      "term": {
        "field": "subject.trigram",
        "suggest_mode": "popular",
        "min_word_length": 1,
        "string_distance": "ngram"
      }
    }
  }
}

Note that I'm performing suggester to subject.trigram instead of subject itself.

Here is the result :

{
    "suggest": {
        "subjectSuggester": [
            {
                "text": "sofa",
                "offset": 0,
                "length": 4,
                "options": [
                    {
                        "text": "soffa",
                        "score": 0.8,
                        "freq": 282
                    },
                    {
                        "text": "soffan",
                        "score": 0.6666666,
                        "freq": 5
                    },
                    {
                        "text": "som",
                        "score": 0.625,
                        "freq": 102
                    },
                    {
                        "text": "sol",
                        "score": 0.625,
                        "freq": 82
                    },
                    {
                        "text": "sony",
                        "score": 0.625,
                        "freq": 50
                    }
                ]
            }
        ]
    }
}

As you can see above soffa appears as first suggestion.