Home > Software engineering >  Elasticsearch's minimumShouldMatch for each member of an array
Elasticsearch's minimumShouldMatch for each member of an array

Time:11-11

Consider an Elasticsearch entity:

{
   "id": 123456,
   "keywords": ["apples", "bananas"]
}

Now, imagine I would like to find this entity by searching for apple.

{
  "match" : {
    "keywords" : {
      "query" : "apple",
      "operator" : "AND",
      "minimum_should_match" : "75%"
    }
  }
}

The problem is that the 75% minimum for matching would be required for both of the strings of the array – so nothing will be found. Is there a way to say something like minimumSouldMatch: "75% of any array fields"?

Note that I need to use AND as each item of keywords may be composed of longer text.

EDIT:

I tried the proposed solutions, but none of them was giving expected results. I guess the problem is that the text might be quite long, eg.:

["national gallery in prague", "narodni galerie v praze"]

I guess the fuzzy expansion is just not able to expand such long strings if you just start searching by "national g".

Would this may be be possible somehow via Nested objects?

{ keywords: [{keyword: "apples"}, {keyword: "babanas"}}

and then have minimumShouldMatch=1 on keywords and then 75% on each keyword?

CodePudding user response:

As per doc

The match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator parameter can be set to or or and to control the boolean clauses (defaults to or). The minimum number of optional should clauses to match can be set using the minimum_should_match parameter.

If you are searching for multiple tokens example "apples mangoes" and set minimum as 100%. It will mean both tokens should be present in document. If you set it at 50% , it means at least one of these should be present.

If you want to match tokens partially

You can use fuzziness parameter

Using fuzziness you can set maximum edit distance allowed for matching

{
  "query": {
    "match": {
      "keywords": {
        "query": "apple",
        "fuzziness": "auto"
      }
    }
  }
}

If you are trying to match word to its root form you can use "stemming" token filter

PUT index-name
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [ "stemmer" ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "keywords":{
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

Tokens generated

GET index-name/_analyze
{ 
   "text":  ["apples", "bananas"],
   "analyzer": "my_analyzer"
}

"tokens" : [
    {
      "token" : "appl",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "banana",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 101
    }
  ]

stemming breaks words to their root form.

You can also explore n-grams, edge grams for partial matching

  • Related