Home > Mobile >  Which analyzer is used while using fuzzy operator with query_string clause?
Which analyzer is used while using fuzzy operator with query_string clause?

Time:11-15

Suppose I have a query clause like,

{
    "query":
    {
  "query_string": {
    "query": "ads spark~",
    "fields": [
      "flowName",
      "projectName"
    ],
    "default_operator": "and"
  }
}
}

For this the explain output is:

"explanation": " (projectName:ads | flowName:ads)  (projectName:spark~1 | flowName:spark~1)"

Whereas if I remove the fuzzy operator from query. Updated query clause below,

{
    "query":
    {
  "query_string": {
    "query": "ads spark",
    "fields": [
      "flowName",
      "projectName"
    ],
    "default_operator": "and"
  }
}
}

I get a different explain output,

"explanation": "(projectName:ads spark | flowName:ads spark)"

Any idea why the tokens generated as different in both cases?

CodePudding user response:

When you use fuzzy queries the way the query is parsed and constructed in Lucene differs from the normal behavior. The one you see with the explanation is the Lucene query built from the query text. When using fuzziness most of the text analysis is not done, only the filters that work on a per-character basis are allowed, as you can read in the documentation [1][2].

In this first case, since you are using fuzziness, the query text is split by whitespaces. Then, for each term a mandatory clause is built (the AND operator states that each term MUST appear in the document). You can call this a "term centric" query. Then each term is searched across the multiple fields in input with a disjunction (|) clause. You therefore see "ads MUST be in projectName OR flowName, AND spark (with variations within the Levenshtein_distance) MUST be in projectName OR flowName".

In the second case, no fuzziness is used. Here the query is passed to each field and then the terms will follow the corresponding field text analysis (if any). You may call this a "field centric" query. Therefore you see "ads spark MUST be in projectName OR flowName" to have a document match. You are effectively moving from an "I want all the terms to appear in the document" (it could be in different fields) to "I want all terms to appear in a single field".

If you want an in-depth analysis you can read this blog post https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html. This is relative to Solr but Elasticsearch applies the same behavior.

  • Related