Elasticsearch best similarity for retrieving exact matches-CodePudding

I have an index with 1 million phrases and I want to search in the index with some query phrases in italian (and that is not the problem). The problem is in the order in which the matches are retrieved: I want to have first the exact matches so I changed the default similarity to "boolean" and I thought it was a good idea but sometimes it does not work. For example: searching in my index for phrases containing the words "film cortometraggio" the first matches are:

Distribuito dalla General Film Company, il film- un cortometraggio in due bobine
Distribuito dalla General Film Company, il film - un cortometraggio di 150 metri - uscì nelle sale cinematografiche

But there are some better phrases that should be returned before those ones like:

Robinet aviatore Robinet aviatore è un film cortometraggio del 1911 diretto da Luigi Maggi;

This last phrase should be returned first in my opinion because there is no space between the two words I am searching for.

Using the BM25 algorithm the first match that I get is "Pappi Corsicato Ha diretto film, cortometraggi, documentari e videoclip.". In this case also should be provided the phrase "Robinet aviatore Robinet aviatore è un film cortometraggio del 1911 diretto da Luigi Maggi;" because is an exact match and I don't get why the algorithm gives the other phrase a higher score.

I am using the Java Rest high level client and the search query that I'm doing are simple match Phrase query, like this: searchSourceBuilder.query(QueryBuilders.matchPhraseQuery(field, text).slop(5)

This is the structure of the documents in my index:

XContentBuilder builder = XContentFactory.jsonBuilder();
            builder.startObject();
            {
                    builder.field("id",id);
                    builder.field("frase",frase);
            }builder.endObject();
          IndexRequest indexRequest = new IndexRequest(indice);
          indexRequest.source(builder);
          IndexResponse indexResponse = client.index(indexRequest, RequestOptions.DEFAULT);

Does anyone know how can I change the similarity criteria to retrieve the matches in the correct order?

CodePudding user response：

I have replicated your problem in my ambient, same version, same analyzers but I still received the same results. Probably that is for the BM25 algorithm, the other millions of docs influence the score.

I have some suggestions that could help you to solve the problem:

Don't use the full steaming Analyzers because they are too intrusive, use the light version
You could complement the light analyzer using the ngram tokenizer
You could create a bool query that matches first to the fields without the analyzer using a multifield

mapping Example:

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "fields": {
          "analyzed": { 
            "type":  "text",
        "analyzer": "my_italian_analyzer"
          }
        }
      }
    }
  }
}

And the query could be something like:

GET italian_example/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "message": {
              "query": "film cortometraggio",
              "slop": 5
            }
          }
        },
        {
          "match_phrase": {
            "message.analyzed": {
              "query": "film cortometraggio",
              "slop": 5
            }
          }
        }
      ]
    }
  }
}

CodePudding user response：

You can use boolean query like this:

{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "message": {
              "query": "film cortometraggio",
              "slop": 5
            }
          }
        },
        {
          "match_phrase": {
            "message": "film cortometraggio"
          }
        }
      ]
    }
  }
}

In this query you boost the particular phrase.