Home > Software design >  aws opensearch: Why are similar sets of data ranked differently
aws opensearch: Why are similar sets of data ranked differently

Time:04-01

I have set up an AWS Opensearch instance with pretty much everything set to default values. i then have inserted some data regarding hotels. When the user searches like Good Morning B my resulting query POST request looks like this:

{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "good morning b*",
                        "fields": ["name"],
                        "default_operator": "and"
                    }
                },
                {
                    "match": {
                        "provider": "SomeProvider"
                    }
                }
            ]
        }
    }
    "sort": {
        "_score": {
            "order": "desc"
        },
        "name.keyword": {
            "order": "asc"
        }
    }
}

The result contains 4 entries with 2 different hotels. The names and all the other data in the index besides the ID are the same. Here is an excerpt of the response:

{
  "took": 442,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "hotels",
        "_type": "_doc",
        "_id": "1",
        "_score": 11.143229,
        "_source": {
          "id": "1",
          "name": "Good Morning   Berlin City East",
          "provider": "SomeProvider"
        },
        "sort": [
          11.143229,
          "Good Morning   Berlin City East"
        ]
      },
      {
        "_index": "hotels",
        "_type": "_doc",
        "_id": "2",
        "_score": 10.455675,
        "_source": {
          "id": "2",
          "name": "Good Morning Bad Oldesloe",
          "provider": "SomeProvider"
        },
        "sort": [
          10.455675,
          "Good Morning Bad Oldesloe"
        ]
      },
      {
        "_index": "hotels",
        "_type": "_doc",
        "_id": "3",
        "_score": 10.455675,
        "_source": {
          "id": "3",
          "name": "Good Morning Bad Oldesloe",
          "provider": "SomeProvider"
        },
        "sort": [
          10.455675,
          "Good Morning Bad Oldesloe"
        ]
      },
      {
        "_index": "hotels",
        "_type": "_doc",
        "_id": "4",
        "_score": 9.6945305,
        "_source": {
          "id": "4",
          "name": "Good Morning   Berlin City East",
          "provider": "SomeProvider"
        },
        "sort": [
          9.6945305,
          "Good Morning   Berlin City East"
        ]
      }
    ]
  }
}

You can see that the "Good Morning Berlin City East" has two different ranks for the entries. Like i said, the containing data is exactly the same. Since the name is the same, i would have expected it to be ranked equally like it is the case for the "Good Morning Bad Oldesloe" hotel.

I ran the same query with the explain=true parameter and got this for the Berlin entries (i only post the relevant part here to make it a bit compact):

// ID = 1
{
  "sort": [
    11.143229,
    "Good Morning   Berlin City East"
  ],
  "_explanation": {
    "value": 11.143229,
    "description": "sum of:",
    "details": [
      {
        "value": 9.302926,
        "description": "sum of:",
        "details": [
          {
            "value": 4.151463,
            "description": "weight(name:good in 1) [PerFieldSimilarity], result of:",
            "details": [
              {
                "value": 4.151463,
                "description": "score(freq=1.0), computed as boost * idf * tf from:",
                "details": [
                  {
                    "value": 2.2,
                    "description": "boost",
                    "details": []
                  },
                  {
                    "value": 4.811831,
                    "description": "idf, computed as log(1   (N - n   0.5) / (n   0.5)) from:",
                    "details": [
                      {
                        "value": 11,
                        "description": "n, number of documents containing term",
                        "details": []
                      },
                      {
                        "value": 1413,
                        "description": "N, total number of documents with field",
                        "details": []
                      }
                    ]
                  },
                  {
                    "value": 0.3921644,
                    "description": "tf, computed as freq / (freq   k1 * (1 - b   b * dl / avgdl)) from:",
                    "details": [
                      {
                        "value": 1.0,
                        "description": "freq, occurrences of term within document",
                        "details": []
                      },
                      {
                        "value": 1.2,
                        "description": "k1, term saturation parameter",
                        "details": []
                      },
                      {
                        "value": 0.75,
                        "description": "b, length normalization parameter",
                        "details": []
                      },
                      {
                        "value": 5.0,
                        "description": "dl, length of field",
                        "details": []
                      },
                      {
                        "value": 3.6001415,
                        "description": "avgdl, average length of field",
                        "details": []
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value": 4.151463,
            "description": "weight(name:morning in 1) [PerFieldSimilarity], result of:",
            "details": [
              {
                "value": 4.151463,
                "description": "score(freq=1.0), computed as boost * idf * tf from:",
                "details": [
                  {
                    "value": 2.2,
                    "description": "boost",
                    "details": []
                  },
                  {
                    "value": 4.811831,
                    "description": "idf, computed as log(1   (N - n   0.5) / (n   0.5)) from:",
                    "details": [
                      {
                        "value": 11,
                        "description": "n, number of documents containing term",
                        "details": []
                      },
                      {
                        "value": 1413,
                        "description": "N, total number of documents with field",
                        "details": []
                      }
                    ]
                  },
                  {
                    "value": 0.3921644,
                    "description": "tf, computed as freq / (freq   k1 * (1 - b   b * dl / avgdl)) from:",
                    "details": [
                      {
                        "value": 1.0,
                        "description": "freq, occurrences of term within document",
                        "details": []
                      },
                      {
                        "value": 1.2,
                        "description": "k1, term saturation parameter",
                        "details": []
                      },
                      {
                        "value": 0.75,
                        "description": "b, length normalization parameter",
                        "details": []
                      },
                      {
                        "value": 5.0,
                        "description": "dl, length of field",
                        "details": []
                      },
                      {
                        "value": 3.6001415,
                        "description": "avgdl, average length of field",
                        "details": []
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value": 1.0,
            "description": "name:b*",
            "details": []
          }
        ]
      },
      {
        "value": 1.840302,
        "description": "weight(provider:hob in 1) [PerFieldSimilarity], result of:",
        "details": [
          {
            "value": 1.840302,
            "description": "score(freq=1.0), computed as boost * idf * tf from:",
            "details": [
              {
                "value": 2.2,
                "description": "boost",
                "details": []
              },
              {
                "value": 1.8403021,
                "description": "idf, computed as log(1   (N - n   0.5) / (n   0.5)) from:",
                "details": [
                  {
                    "value": 224,
                    "description": "n, number of documents containing term",
                    "details": []
                  },
                  {
                    "value": 1413,
                    "description": "N, total number of documents with field",
                    "details": []
                  }
                ]
              },
              {
                "value": 0.45454544,
                "description": "tf, computed as freq / (freq   k1 * (1 - b   b * dl / avgdl)) from:",
                "details": [
                  {
                    "value": 1.0,
                    "description": "freq, occurrences of term within document",
                    "details": []
                  },
                  {
                    "value": 1.2,
                    "description": "k1, term saturation parameter",
                    "details": []
                  },
                  {
                    "value": 0.75,
                    "description": "b, length normalization parameter",
                    "details": []
                  },
                  {
                    "value": 1.0,
                    "description": "dl, length of field",
                    "details": []
                  },
                  {
                    "value": 1.0,
                    "description": "avgdl, average length of field",
                    "details": []
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

// ID = 2{
  "sort": [
      9.6945305,
      "Good Morning   Berlin City East"
  ],
  "_explanation": {
      "value": 9.6945305,
      "description": "sum of:",
      "details": [
          {
              "value": 7.975009,
              "description": "sum of:",
              "details": [
                  {
                      "value": 3.4875045,
                      "description": "weight(name:good in 380) [PerFieldSimilarity], result of:",
                      "details": [
                          {
                              "value": 3.4875045,
                              "description": "score(freq=1.0), computed as boost * idf * tf from:",
                              "details": [
                                  {
                                      "value": 2.2,
                                      "description": "boost",
                                      "details": []
                                  },
                                  {
                                      "value": 4.0562115,
                                      "description": "idf, computed as log(1   (N - n   0.5) / (n   0.5)) from:",
                                      "details": [
                                          {
                                              "value": 24,
                                              "description": "n, number of documents containing term",
                                              "details": []
                                          },
                                          {
                                              "value": 1414,
                                              "description": "N, total number of documents with field",
                                              "details": []
                                          }
                                      ]
                                  },
                                  {
                                      "value": 0.39081526,
                                      "description": "tf, computed as freq / (freq   k1 * (1 - b   b * dl / avgdl)) from:",
                                      "details": [
                                          {
                                              "value": 1.0,
                                              "description": "freq, occurrences of term within document",
                                              "details": []
                                          },
                                          {
                                              "value": 1.2,
                                              "description": "k1, term saturation parameter",
                                              "details": []
                                          },
                                          {
                                              "value": 0.75,
                                              "description": "b, length normalization parameter",
                                              "details": []
                                          },
                                          {
                                              "value": 5.0,
                                              "description": "dl, length of field",
                                              "details": []
                                          },
                                          {
                                              "value": 3.5749645,
                                              "description": "avgdl, average length of field",
                                              "details": []
                                          }
                                      ]
                                  }
                              ]
                          }
                      ]
                  },
                  {
                      "value": 3.4875045,
                      "description": "weight(name:morning in 380) [PerFieldSimilarity], result of:",
                      "details": [
                          {
                              "value": 3.4875045,
                              "description": "score(freq=1.0), computed as boost * idf * tf from:",
                              "details": [
                                  {
                                      "value": 2.2,
                                      "description": "boost",
                                      "details": []
                                  },
                                  {
                                      "value": 4.0562115,
                                      "description": "idf, computed as log(1   (N - n   0.5) / (n   0.5)) from:",
                                      "details": [
                                          {
                                              "value": 24,
                                              "description": "n, number of documents containing term",
                                              "details": []
                                          },
                                          {
                                              "value": 1414,
                                              "description": "N, total number of documents with field",
                                              "details": []
                                          }
                                      ]
                                  },
                                  {
                                      "value": 0.39081526,
                                      "description": "tf, computed as freq / (freq   k1 * (1 - b   b * dl / avgdl)) from:",
                                      "details": [
                                          {
                                              "value": 1.0,
                                              "description": "freq, occurrences of term within document",
                                              "details": []
                                          },
                                          {
                                              "value": 1.2,
                                              "description": "k1, term saturation parameter",
                                              "details": []
                                          },
                                          {
                                              "value": 0.75,
                                              "description": "b, length normalization parameter",
                                              "details": []
                                          },
                                          {
                                              "value": 5.0,
                                              "description": "dl, length of field",
                                              "details": []
                                          },
                                          {
                                              "value": 3.5749645,
                                              "description": "avgdl, average length of field",
                                              "details": []
                                          }
                                      ]
                                  }
                              ]
                          }
                      ]
                  },
                  {
                      "value": 1.0,
                      "description": "name:b*",
                      "details": []
                  }
              ]
          },
          {
              "value": 1.719521,
              "description": "weight(provider:hob in 380) [PerFieldSimilarity], result of:",
              "details": [
                  {
                      "value": 1.719521,
                      "description": "score(freq=1.0), computed as boost * idf * tf from:",
                      "details": [
                          {
                              "value": 2.2,
                              "description": "boost",
                              "details": []
                          },
                          {
                              "value": 1.719521,
                              "description": "idf, computed as log(1   (N - n   0.5) / (n   0.5)) from:",
                              "details": [
                                  {
                                      "value": 253,
                                      "description": "n, number of documents containing term",
                                      "details": []
                                  },
                                  {
                                      "value": 1414,
                                      "description": "N, total number of documents with field",
                                      "details": []
                                  }
                              ]
                          },
                          {
                              "value": 0.45454544,
                              "description": "tf, computed as freq / (freq   k1 * (1 - b   b * dl / avgdl)) from:",
                              "details": [
                                  {
                                      "value": 1.0,
                                      "description": "freq, occurrences of term within document",
                                      "details": []
                                  },
                                  {
                                      "value": 1.2,
                                      "description": "k1, term saturation parameter",
                                      "details": []
                                  },
                                  {
                                      "value": 0.75,
                                      "description": "b, length normalization parameter",
                                      "details": []
                                  },
                                  {
                                      "value": 1.0,
                                      "description": "dl, length of field",
                                      "details": []
                                  },
                                  {
                                      "value": 1.0,
                                      "description": "avgdl, average length of field",
                                      "details": []
                                  }
                              ]
                          }
                      ]
                  }
              ]
          }
      ]
  }
}

The main difference and the cause for the difference in rank seems to be the n, number of documents containing term which is 11 in case of the higher ranked id = 1 and 24 in case of the lower ranked id = 2. But since every data field is the same (besides the id), should'nt it be the same number? The search term is the same for both entries.

Can somebody explain to me (in easy words without much mathematics please) why there is a difference for this hotel but not for the one in Bad Oldesloe (here, as one would expect, the numbers in the explanation has been the same)?

Thanks in advance

CodePudding user response:

The number of documents is counted not for the whole index by Elasticsearch but by the underlying Lucene engine, and it's done per shard (each shard is a complete Lucene index). Since your documents are (probably) in different shards, their score turns out slightly different.

  • Related