I have an index with 1 million phrases and I want to search in the index with some query phrases in italian (and that is not the problem). The problem is in the order in which the matches are retrieved: I want to have first the exact matches so I changed the default similarity to "boolean" and I thought it was a good idea but sometimes it does not work. For example: searching in my index for phrases containing the words "film cortometraggio" the first matches are:
- Distribuito dalla General Film Company, il film- un cortometraggio in due bobine
- Distribuito dalla General Film Company, il film - un cortometraggio di 150 metri - uscì nelle sale cinematografiche
But there are some better phrases that should be returned before those ones like:
- Robinet aviatore Robinet aviatore è un film cortometraggio del 1911 diretto da Luigi Maggi;
This last phrase should be returned first in my opinion because there is no space between the two words I am searching for.
Using the BM25 algorithm the first match that I get is "Pappi Corsicato Ha diretto film, cortometraggi, documentari e videoclip.". In this case also should be provided the phrase "Robinet aviatore Robinet aviatore è un film cortometraggio del 1911 diretto da Luigi Maggi;" because is an exact match and I don't get why the algorithm gives the other phrase a higher score.
I am using the Java Rest high level client and the search query that I'm doing are simple match Phrase query, like this: searchSourceBuilder.query(QueryBuilders.matchPhraseQuery(field, text).slop(5)
This is the structure of the documents in my index:
XContentBuilder builder = XContentFactory.jsonBuilder();
builder.startObject();
{
builder.field("id",id);
builder.field("frase",frase);
}builder.endObject();
IndexRequest indexRequest = new IndexRequest(indice);
indexRequest.source(builder);
IndexResponse indexResponse = client.index(indexRequest, RequestOptions.DEFAULT);
Does anyone know how can I change the similarity criteria to retrieve the matches in the correct order?
CodePudding user response:
I have replicated your problem in my ambient, same version, same analyzers but I still received the same results. Probably that is for the BM25 algorithm, the other millions of docs influence the score.
I have some suggestions that could help you to solve the problem:
- Don't use the full steaming Analyzers because they are too intrusive, use the light version
- You could complement the light analyzer using the ngram tokenizer
- You could create a bool query that matches first to the fields without the analyzer using a multifield
mapping Example:
PUT my-index-000001
{
"mappings": {
"properties": {
"message": {
"type": "text",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "my_italian_analyzer"
}
}
}
}
}
}
And the query could be something like:
GET italian_example/_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"message": {
"query": "film cortometraggio",
"slop": 5
}
}
},
{
"match_phrase": {
"message.analyzed": {
"query": "film cortometraggio",
"slop": 5
}
}
}
]
}
}
}
CodePudding user response:
You can use boolean query like this:
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"message": {
"query": "film cortometraggio",
"slop": 5
}
}
},
{
"match_phrase": {
"message": "film cortometraggio"
}
}
]
}
}
}
In this query you boost the particular phrase.