How to build an Elasticsearch query that will take into account the distance between words?-CodePudding

I'm running with elasticsearch:7.6.2

I have an index with 4 simple documents:

    PUT demo_idx/_doc/1
    {
      "content": "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
    }

    PUT demo_idx/_doc/2
    {
      "content": "Distributed tmp nature, simple REST APIs, speed, and scalability"
    }

    PUT demo_idx/_doc/3
    {
      "content": "Distributed nature, simple REST APIs, speed, and scalability"
    }

    PUT demo_idx/_doc/4
    {
      "content": "Distributed tmp tmp nature"
    }

I want to search for the text: distributed nature and get the following results order:

Doc id: 3 
Doc id: 1
Doc id: 2
Doc id: 4

i.e documents with exact match (doc 3 & doc 1) will be displayed before documents with small slop (doc 2) and documents with big slop match will be last displayed (doc 4)

I read this post: How to build an Elasticsearch query that will take into account the distance between words and the exactitude of the word but it didn't help me

I have tried the following seach query:

"query": {
            "bool": {
                "must":
                    [{
                        "match_phrase": {
                            "content": {
                                "query": query,
                                "slop": 2
                            }
                        }
                    }]
            }
        }

But it didnt gave me the required results.

I got the following results:

Doc id: 3  ,Score: 0.22949813
Doc id: 4  ,Score: 0.15556586
Doc id: 1  ,Score: 0.15401536 
Doc id: 2  ,Score: 0.14397088

How can I write the query in order to get the results I want to ?

CodePudding user response：

You can show the documents that match exactly with "Distributed nature", by using a bool should clause. The first clause will boost the score of, those documents that match exactly with "Distributed nature", without any slop.

POST demo_idx/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "content": {
              "query": "Distributed nature"
            }
          }
        },
        {
          "match_phrase": {
            "content": {
              "query": "Distributed nature",
              "slop": 2
            }
          }
        }
      ]
    }
  }
}

Search Response will be:

"hits" : [
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.45899627,
        "_source" : {
          "content" : "Distributed nature, simple REST APIs, speed, and scalability"
        }
      },
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.30803072,
        "_source" : {
          "content" : "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
        }
      },
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.15556586,
        "_source" : {
          "content" : "Distributed tmp tmp nature"
        }
      },
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.14397088,
        "_source" : {
          "content" : "Distributed tmp nature, simple REST APIs, speed, and scalability"
        }
      }
    ]

Update 1:

In order to avoid the impact of "length of the field" param in the search query scoring, you need to disable the "norms" param for "content" field, using the Update mapping API

PUT demo_idx/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "norms": "false"
    }
  }
}

After this, reindex the documents again, so that norms will not be removed instantly

Now hit the search query, the search response will be in the order you expect to get.