How does Elasticsearch search documents? How to customize preprocess pipeline and scoring functions-CodePudding

I want to implement Elasticsearch on a customized corpus. I have installed elasticsearch of version 7.5.1 and I do all my work in python using the official client.

Here I have a few questions:

How to customize preprocess pipeline? For example, I want to use a BertTokenizer to convert strings to tokens instead of ngrams
How to customize scoring function of each document w.r.t. the query? For example, I want to compare effects of tf-idf with bm25, or even using some neural models for scoring.

If there is great tutorial in python, please share with me. Thanks in advance.

CodePudding user response：

You can customize the similarity function when creating an index. See the Similarity Module section of the documentation. You can find a good article that compares classical TF_IDF with BM25 on the OpenSource Connections site.

In general, Elasticsearch uses an inverted index to look up all documents that contain a specific word or token.

It sounds like you want to use vector fields for scoring, there is a good article on the elastic blog that explains how you can achieve that. Be aware that as of now Elasticsearch is not using vector fields for retrieval, only for scoring, if you want to use vector fields for retrieval you have to use a plugin, or the OpenSearch fork, or wait for version 8.

In my opinion, using ANN in real-time during search is too slow and expensive, and i have yet to see improvements in relevancy with normal search requests.

I would do the preprocessing of your documents in your own python environment before indexing and not use any Elasticsearch pipelines or plugins. It is easier to debug and iterate outside of Elasticsearch.

You could also take a look at the Haystack Project, it might have a lot of the functionality that you are looking for, already build in.