Home > Back-end >  Elasticsearch fuzziness with multi_match and bool_prefix type
Elasticsearch fuzziness with multi_match and bool_prefix type

Time:12-09

I have a set of search_as_you_type_fields I need to search against. Here is my mapping

"mappings" : {
      "properties" : {
        "description" : {
          "type" : "search_as_you_type",
          "doc_values" : false,
          "max_shingle_size" : 3
        },
        "questions" : {
          "properties" : {
            "content" : {
              "type" : "search_as_you_type",
              "doc_values" : false,
              "max_shingle_size" : 3
            },
            "tags" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              }
            }
          }
        },
      
        "title" : {
          "type" : "search_as_you_type",
          "doc_values" : false,
          "max_shingle_size" : 3
        },
      }
    }

I am using a multi_match query with bool_prefix type.

"query": {
    "multi_match": {
      "query": "triangle", 
      "type": "bool_prefix",
       "fields": [
           "title",
           "title._2gram",
           "title._3gram",
           "description",
           "description._2gram",
           "description._3gram",
           "questions.content",
           "questions.content._2gram",
           "questions.content._3gram",
           "questions.tags",
           "questions.tags._2gram",
           "questions.tags._3gram"
       ]
    }
  }

So far works fine. Now I want to add a typo tolerance which is fuzziness in ES. However, looks like bool_prefix has some conflicts working with this. So if I modify my query and add "fuzziness": "AUTO" and make an error in a word "triangle" -> "triangld", it won't get any results.

However, if I am looking for a phrase "right triangle", I have some different behavior:

  1. even if no typos is made, I got more results with just "fuzziness": "AUTO" (1759 vs 1267)
  2. if I add a typo to the 2d word "right triangdd", it seems to work, however looks like it now pushes the results containing "right" without "triangle" first ("The Bill of Rights", "Due process and right to privacy" etc.) in front.
  3. If I make a typo in the 1st word ("righd triangle") or both ("righd triangdd"), the results seems to be just fine. So this is probably the only correct behavior.

I've seen a couple of articles and even GitHub issues that fuzziness does not work in a proper way with a multi_match query with bool_prefix, however I can't find a workaround for this. I've tried changing the query type, but looks like bool_prefix is the only one that supports search as you type and I need to get search result as a user starts typing something.

Since I make all the requests from ES from our backend What I also can do is manipulate a query string to build different search query types if needed. For example, for 1 word searches use one type for multi use another. But I basically need to maintain current behavior.

I've also tried appending a sign "~" or "~1[2]" to the string which seems to be another way of specifying the fuzziness, but the results are rather unclear and performance (search speed) seems to be worse.

My questions are:

  1. How can I achieve fuzziness for 1 word searches? so that query "triangld" returns documents containing "triangle" etc.
  2. How can I achieve correct search results when the typo in the 2d (last?) word of the query? Like I mentioned above it works, but see the point 2 above
  3. Why just adding a fuzziness (see p. 1) returns more results even if the phrase is correct?
  4. Anything I need to change in my analyzers etc.?

CodePudding user response:

so to achieve a desired behavior, we did the following:

  1. change query type to "query_string"
  2. added query string preprocessing on the backend. We split the query string by white spaces and add "~1" or "~2" to each word if their length is more 4 chars or 8 chars respectively. ~ is a fuzziness syntax in ES. However, we don't add this to the current typing word until the user types a white space. For example, user typing [t, tr, tri, ... triangle] => no fuzzy, but once "triangle " => "triangle~2". This is because there will be unexpected results with the last word having fuzziness
  3. we also removed all ngram fields from the search fields as we get the same results but performance is a bit better.
  4. added "default_operator": "AND" to the query to contain the results from one field for phrase queries
  • Related