Indexing/search algorithm stability between versions-CodePudding

I'm migrating from Elasticsearch 1.5 to 7.10 there are multiple required changes, the most relevant one is the removal of the document type concept in version 6, to deal with it I introduced a new field doc_type and then I match with it when I search. My question is, when I make the same (or equivalent because there are some changes) search query should I expect to have the exact same result set? Because I'm having some differences, so I would like to figure out if I broke something in the new mappings or in the search query. Thank you in advance

Edit after first question:

In general: I have a service that communicates with ES 1.5 and I have to migrate it to ES 7.10 keeping the external API as stable as possible.

I'm not using scoring.
Previously I had document types A and B, when I make a query like this for example: host/indexname/A,B/_search, after the migration I keep A or B in doc_type, and the query becomes host/indexname/_search with a "bool":{"should":[{"terms":{"doc_type":["A"],"boost":1.0}},{"terms":{"doc_type":["B"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0} in the body. If I put it in different indexes for A and B and the user want to match in both of them I'll have to "merge" the search response for both queries and I don't know which strategy should I follow for that, so keeping it all together I get a response with mixed (doc_type) results from ES. I followed this specific approach https://www.elastic.co/blog/removal-of-mapping-types-elasticsearch#custom-type-field
The differences are not so big, difficult to show a concrete example because it's a complex data/doc structure but the idea is, having for 1.5 this response for a giving query for example: [a, b, c, d, e, f, g, h, i, j] (where each one may have any of types A or B) With 7.10 I'm having responses like: [a, b, e, c, d, f, g, h, i, j] or [a, b, c, d, e, g, i, j, k]

Second edit: This query has been generated from the java client.

{
   "from":0,
   "size":100,
   "query":{
      "bool":{
         "must":[
            {
               "query_string":{
                  "query":"mark_deleted:false",
                  "fields":[
                     
                  ],
                  "type":"best_fields",
                  "default_operator":"or",
                  "max_determinized_states":10000,
                  "enable_position_increments":true,
                  "fuzziness":"AUTO",
                  "fuzzy_prefix_length":0,
                  "fuzzy_max_expansions":50,
                  "phrase_slop":0,
                  "escape":false,
                  "auto_generate_synonyms_phrase_query":true,
                  "fuzzy_transpositions":true,
                  "boost":1.0
               }
            },
            {
               "bool":{
                  "should":[
                     {
                        "terms":{
                           "type":[
                              "A"
                           ],
                           "boost":1.0
                        }
                     },
                     {
                        "terms":{
                           "type":[
                              "B"
                           ],
                           "boost":1.0
                        }
                     },
                     {
                        "terms":{
                           "type":[
                              "D"
                           ],
                           "boost":1.0
                        }
                     }
                  ],
                  "adjust_pure_negative":true,
                  "boost":1.0
               }
            }
         ],
         "adjust_pure_negative":true,
         "boost":1.0
      }
   },
   "post_filter":{
      "term":{
         "mark_deleted":{
            "value":false,
            "boost":1.0
         }
      }
   },
   "sort":[
      {
         "a_specific_date":{
            "order":"desc"
         }
      }
   ],
   "highlight":{
      "pre_tags":[
         "<b>"
      ],
      "post_tags":[
         "</b>"
      ],
      "no_match_size":120,
      "fields":{
         "body":{
            "fragment_size":120,
            "number_of_fragments":1
         }
      }
   }
}

CodePudding user response：

First, since you don't care about scoring you should use bool/filter instead of bool/must at the top level, otherwise your results are sorted by _score by default and between 1.7 et 7.10, there have been so many changes that it would explain the differences you get. So you're better off simply sorting the results using any other field than _score

Second, instead of the bool/should on type you can use a simple terms query, which does exactly the same job, yet in a simpler way:

{
  "from": 0,
  "size": 100,
  "query": {
    "bool": {
      "filter": [
        {
          "query_string": {
            "query": "mark_deleted:false",
            "fields": [],
            "type": "best_fields",
            "default_operator": "or",
            "max_determinized_states": 10000,
            "enable_position_increments": true,
            "fuzziness": "AUTO",
            "fuzzy_prefix_length": 0,
            "fuzzy_max_expansions": 50,
            "phrase_slop": 0,
            "escape": false,
            "auto_generate_synonyms_phrase_query": true,
            "fuzzy_transpositions": true,
            "boost": 1
          }
        },
        {
          "terms": {
            "type": [
              "A",
              "B",
              "C"
            ]
          }
        }
      ]
    }
  },
  "post_filter": {
    "term": {
      "mark_deleted": {
        "value": false,
        "boost": 1
      }
    }
  },
  "sort": [
    {
      "a_specific_date": {
        "order": "desc"
      }
    }
  ],
  "highlight": {
    "pre_tags": [
      "<b>"
    ],
    "post_tags": [
      "</b>"
    ],
    "no_match_size": 120,
    "fields": {
      "body": {
        "fragment_size": 120,
        "number_of_fragments": 1
      }
    }
  }
}

Finally, I'm not sure why you're using a query_string query to do an exact match on mark_deleted:false, it doesn't make sense to me. A simple term query would be better and more adequate here.

Also not clear why you have remove all results that also have mark_deleted:false in your post_filter, since it's the same condition as in your query_string constraint.