Hibernate search with lucene does not index similar names correctly-CodePudding

I'm learning Hibernate Search 6.1.3.Final with Lucene 8.11.1 as backend and Spring Boot 2.6.6. I'm trying to create a search for product names, barcodes and manufacturers. Currently, I'm doing an integration test to see what happens when a couple of products have similar name:

    @Test
    void shouldFindSimilarTobaccosByQuery() {
        var tobaccoGreen = TobaccoBuilder.builder()
            .name("TobaCcO GreEN")
            .build();
        var tobaccoRed = TobaccoBuilder.builder()
            .name("TobaCcO ReD")
            .build();
        var tobaccoGreenhouse = TobaccoBuilder.builder()
            .name("TobaCcO GreENhouse")
            .build();
        tobaccoRepository.saveAll(List.of(tobaccoGreen, tobaccoRed, tobaccoGreenhouse));

        webTestClient
            .get().uri("/tobaccos?query=green")
            .exchange()
            .expectStatus().isOk()
            .expectBodyList(Tobacco.class)
            .value(tobaccos -> assertThat(tobaccos)
                .hasSize(2)
                .contains(tobaccoGreen, tobaccoGreenhouse)
            );
    }

As you can see in the test, I expect to obtain the two tobaccos with similar names: tobaccoGreen and tobaccoGreenhouse by using a green as query for the search criteria. The entity is the following:

@Data
@Entity
@Indexed
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@EqualsAndHashCode(of = "id")
@EntityListeners(AuditingEntityListener.class)
public class Tobacco {
    @Id
    @GeneratedValue
    private UUID id;
    @NotBlank
    @KeywordField
    private String barcode;
    @NotBlank
    @FullTextField(analyzer = "name")
    private String name;
    @NotBlank
    @FullTextField(analyzer = "name")
    private String manufacturer;
    @CreatedDate
    private Instant createdAt;
    @LastModifiedDate
    private Instant updatedAt;
}

I have followed the docs and configure an analyzer for names:

@Component("luceneTobaccoAnalysisConfigurer")
public class LuceneTobaccoAnalysisConfigurer implements LuceneAnalysisConfigurer {
    @Override
    public void configure(LuceneAnalysisConfigurationContext context) {
        context.analyzer("name").custom()
            .tokenizer("standard")
            .tokenFilter("lowercase")
            .tokenFilter("asciiFolding");
    }
}

And using a simple query with fuzzy option:

@Component
@AllArgsConstructor
public class IndexSearchTobaccoRepository {

    private final EntityManager entityManager;

    public List<Tobacco> find(String query) {
        return Search.session(entityManager)
            .search(Tobacco.class)
            .where(f -> f.match()
                .fields("barcode", "name", "manufacturer")
                .matching(query)
                .fuzzy()
            )
            .fetch(10)
            .hits();
    }
}

The test shows that is only able to find tobaccoGreen but not tobaccoGreenhouse and I don't understand why, how can I search similar product names (or barcodes, manufacturer)?

CodePudding user response：

Before I answer your question, I'd like to point out that calling .fetch(10).hits() is suboptimal, especially when using the default sort (like you do):

        return Search.session(entityManager)
            .search(Tobacco.class)
            .where(f -> f.match()
                .fields("barcode", "name", "manufacturer")
                .matching(query)
                .fuzzy()
            )
            .fetch(10)
            .hits();

If you call .fetchHits(10) directly, Lucene will be able to skip part of the search (the part where it counts the total hit count), and in large indexes this could lead to sizeable performance gains. So, do this instead:

        return Search.session(entityManager)
            .search(Tobacco.class)
            .where(f -> f.match()
                .fields("barcode", "name", "manufacturer")
                .matching(query)
                .fuzzy()
            )
            .fetchHits(10);

Now, the actual answer:

Approaching this through the search query

.fuzzy() isn't magic, it won't just match anything you think should match :) There's a specific definition of what it does, and that's not what you want here.

To get the behavior you want, you could use this instead of your current predicate:

            .where(f -> f.simpleQueryString()
                .fields("barcode", "name", "manufacturer")
                .matching("green*")
            )

You lose fuzziness, but you get the ability to perform prefix queries, which would give the results you want (green* would match greenhouse).

However, the prefix queries are explicit: the user must add * after "green" in order to match "all words that start with green".

Which leads us to...

Approaching this through analyzers

If you want this "prefix matching" behavior to be automatic, without the need to add * in the query, then what you need is a different analyzer.

Your current analyzer breaks down indexed text using space as a separator (more or less; it's a bit more complex but that's the idea). But you apparently want it to break down "greenhouse" into "green" and "house"; that's the only way a query with the word "green" would match the word "greenhouse".

To do that, you can use an analyzer similar to yours, but with an additional "edge_ngram" filter, to generate additional indexed tokens for every prefix string of your existing tokens.

Add another analyzer to your configurer:

@Component("luceneTobaccoAnalysisConfigurer")
public class LuceneTobaccoAnalysisConfigurer implements LuceneAnalysisConfigurer {
    @Override
    public void configure(LuceneAnalysisConfigurationContext context) {
        context.analyzer("name").custom()
            .tokenizer("standard")
            .tokenFilter("lowercase")
            .tokenFilter("asciiFolding");

        // THIS PART IS NEW
        context.analyzer("name_prefix").custom()
            .tokenizer("standard")
            .tokenFilter("lowercase")
            .tokenFilter("asciiFolding")
            .tokenFilter("edgeNGram")
                    // Handling prefixes from 2 to 7 characters.
                    // Prefixes of 1 character or more than 7 will
                    // not be matched.
                    // You can extend the range, but this will take more
                    // space in the index for little gain.
                    .param( "minGramSize", "2" )
                    .param( "maxGramSize", "7" );
    }
}

And change your mapping to use the name analyzer when querying, but the name_prefix analyzer when indexing:

@Data
@Entity
@Indexed
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@EqualsAndHashCode(of = "id")
@EntityListeners(AuditingEntityListener.class)
public class Tobacco {
    @Id
    @GeneratedValue
    private UUID id;
    @NotBlank
    @KeywordField
    private String barcode;
    @NotBlank
    // CHANGE THIS
    @FullTextField(analyzer = "name_prefix", searchAnalyzer = "name")
    private String name;
    @NotBlank
    // CHANGE THIS
    @FullTextField(analyzer = "name_prefix", searchAnalyzer = "name")
    private String manufacturer;
    @CreatedDate
    private Instant createdAt;
    @LastModifiedDate
    private Instant updatedAt;
}

Now reindex your data.

Now your query "green" will also match "TobaCcO GreENhouse", because "GreENhouse" was indexed as ["greenhouse", "gr", "gre", "gree", "green", "greenh", "greenho"].

Variations

`edgeNGram` filter on distinct fields

Instead of changing the analyzer of your current fields, you could add new fields for the same Java properties, but using the new analyzer with the edgeNGram filter:

@Data
@Entity
@Indexed
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@EqualsAndHashCode(of = "id")
@EntityListeners(AuditingEntityListener.class)
public class Tobacco {
    @Id
    @GeneratedValue
    private UUID id;
    @NotBlank
    @KeywordField
    private String barcode;
    @NotBlank
    @FullTextField(analyzer = "name")
    // ADD THIS
    @FullTextField(name = "name_prefix", analyzer = "name_prefix", searchAnalyzer = "name")
    private String name;
    @NotBlank
    @FullTextField(analyzer = "name")
    // ADD THIS
    @FullTextField(name = "manufacturer_prefix", analyzer = "name_prefix", searchAnalyzer = "name")
    private String manufacturer;
    @CreatedDate
    private Instant createdAt;
    @LastModifiedDate
    private Instant updatedAt;
}

Then you can target these fields as well as the normal ones in your query:

@Component
@AllArgsConstructor
public class IndexSearchTobaccoRepository {

    private final EntityManager entityManager;

    public List<Tobacco> find(String query) {
        return Search.session(entityManager)
            .search(Tobacco.class)
            .where(f -> f.match()
                .fields("barcode", "name", "manufacturer").boost(2.0f)
                .fields("name_prefix", "manufacturer_prefix")
                .matching(query)
                .fuzzy()
            )
            .fetchHits(10);
    }
}

As you can see, I added a boost to the fields that don't use prefix. This is the main advantage of this variant over the one I explained higher up: matches on actual words (not prefixes) will be deemed more important, yielding a better score and thus pulling documents to the top of the result list if you use a relevance sort (which is the default sort).

Handling only compound words instead of all words

I won't detail it here, but there's another approach if all you want is to handle compound words ("greenhouse" => "green" "house", "superman" => "super" "man", etc.). You can use the "dictionaryCompoundWord" filter. This is less powerful, but will generate less noise in your index (fewer meaningless tokens) and thus could lead to better relevance sorts. Another downside is that you need to provide the filter with a dictionary that contains all words that could possibly be "compounded". For more information, see the source and javadoc of class org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilterFactory, or the documentation of the equivalent filter in Elasticsearch.

Approaching this through the search query

Approaching this through analyzers

Variations

edgeNGram filter on distinct fields

Handling only compound words instead of all words

`edgeNGram` filter on distinct fields