Adding Elasticsearch sort returns incorrect results?-CodePudding

My queries are successfully returning the exact results that I am looking for.

{"size": 100,"from": 0, "query": {"bool": {"must": [{"bool":{"should":[{"match":{"ProcessId":"from-cn"}}]}}]}}}

This returns only items with ProcessId "from-cn" However, when I add a sort query like this:

{"size": 100,"from": 0,"sort": [{"CreatedTimeStamp": {"order": "desc"}}], "query": {"bool": {"must": [{"bool":{"should":[{"match":{"ProcessId":"from-cn"}}]}}]}}}

This is now returning all "from-cn", but it is also returning several other results that do NOT have ProcessId "from-cn".

I know it is the sort that is causing the issue because when I remove sort, it returns perfectly.

Why is this happening here? How can I fix?

CodePudding user response：

Try this query instead. What does it yield?

{
  "size": 100,
  "from": 0,
  "sort": [
    {
      "CreatedTimeStamp": {
        "order": "desc"
      }
    }
  ],
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "ProcessId": "from-cn"
          }
        }
      ]
    }
  }
}

CodePudding user response：

match query performs full-text search.

It means that it analyzes the provided text producing tokens that will be used when doing actual matching against the document field.

Unless you defined a custom search analyzer for ProcessId field, Elasticsearch will use standard analyzer here.

You can verify what tokens it produces for "from-cn" text using Analyze API, in this case:

POST http://localhost:9200/_analyze
{
  "analyzer" : "standard",
  "text" : "from-cn"
}

The response:

{
  "tokens": [
    {
      "token": "from",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "cn",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

You can see that it produces two tokens: "from" and "cn". So the documents having only one of them will also match the query. In your case, I believe, they simply fell out of the first 100 results that you requested, so you don't see them when searching without custom sort.

When you don't use custom sorting, documents are sorted by score and the documents that are more relevant to the query are higher on the list. In your case, documents matching both tokens will have higher score than those matching only one. But with custom sorting you don't rely on the score anymore, so less relevant documents can be higher.

Solution:

If you want to match the contents of the field exactly, define that field as non-analyzed in your mapping (e.g. using keyword type instead of text) and use a query that doesn't analyze provided text (e.g. term query instead of match).