Home > other >  Hibernate Search / Lucene based Sorting Issue
Hibernate Search / Lucene based Sorting Issue

Time:09-21

I am having an issue in sorting, which specify below.

Previously, the code is writtern as

Sort sort = new Sort(new SortField[] {
   SortField.FIELD_SCORE,
   new SortField("field_1", SortField.STRING),
   new SortField("field_2", SortField.STRING),
   new SortField("field_2", SortField.LONG)
});

and this is an example pasted by the a stackoverflow answer here for custom sorting, Sorting search result in Lucene based on a numeric field.

Though he does not suggest this is the correct way to do the sorting, this is also the code where my company has been used for years.

But when I create a new function, that needs to do sorting on lots of fields, and by performing unit testing, I found that it does not actually work as intended.

I need to remove SortField.FIELD_SCORE in order to make it works great. And I think this is suggested by the example described here if I did understand correctly, https://docs.jboss.org/hibernate/search/4.1/reference/en-US/html_single/#d0e5317.

i.e. the main code will convert to

Sort sort = new Sort(new SortField[] {
   new SortField("field_1", SortField.STRING),
   new SortField("field_2", SortField.STRING),
   new SortField("field_2", SortField.LONG)
});

So my question is

  1. what is the usage of SortField.FIELD_SCORE? How does the field score be calculated?
  2. Why presenting SortField.FIELD_SCORE sometimes return correct value, sometimes don't?

CodePudding user response:

what is the usage of SortField.FIELD_SCORE? How does the field score be calculated?

When you search for documents containing a word, each document gets assigned a "score": a float value, generally positive. The higher this value, the better the match. How exactly this is computed is a bit complex, and it gets worse when you have multiple nested queries (e.g. boolean queries, etc.), because then scores get combined with other formulas. Suffice it to say: the score is a number, there's one value for each document, and higher is better.

SortField.FIELD_SCORE will simply sort documents by descending score.

Why presenting SortField.FIELD_SCORE sometimes return correct value, sometimes don't?

Hard to say. It depends on lots of things, like your analyzers, the exact query you're running, and even the frequency of the search terms in your documents. Like I said, the formula used to compute the score is complex.

One thing that stand out in your sort, though, is that you're sorting by score and by actual fields. That's unlikely to work well. Scores are generally unique, so unless your documents are very similar (e.g. all text fields are empty for some reason), the top documents will have scores like this: [5.1, 3.4, 2.6, 2.4, 2.2]. Their order is already "complete": you can add as many subsequent sorts as you want, the order will not change because it is fully defined by the sort by score.

Think of alphabetical order: if I have to sort ["area", "baby"], the second letter of "baby" may be "a", but it doesn't matter, because the first letter is "b" and it's always going to be after the "a" of "area".

So, if you're not interested in a sort by score (and, if you don't know what score is, chances are you indeed are not interested), just stick to sorts by field:

Sort sort = new Sort(new SortField[] {
   new SortField("field_1", SortField.STRING),
   new SortField("field_2", SortField.STRING),
   new SortField("field_2", SortField.LONG)
});

And if you're interested in a sort by score, then just sort by score:

Sort sort = new Sort(new SortField[] {
   SortField.FIELD_SCORE
});

// Or equivalently
Sort sort = Sort.RELEVANCE; // "Relevance" means "sort by score"

Note that Hibernate Search 4.1 (the version for your documentation link) is very old; you should consider upgrading at least to 5.11 (similar API, also old but still maintained), and preferably to 6.0 (different, but more modern API, new and also maintained).

  • Related