Home > front end >  How to find the actual sentence from sentence transformer?
How to find the actual sentence from sentence transformer?

Time:07-26

I am trying to do semantic search with sentence transformer and faiss.

I am able to generate emebdding from corpus and perform query with the query xq. But what are t

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("flax-sentence-embeddings/st-codesearch-distilroberta-base")

def get_embeddings(code_snippets: str):
    return model.encode(code_snippets)

def build_vector_database(atlas_datapoints):
    dimension = 768  # dimensions of each vector

    corpus = ["tom loves candy",
                    "this is a test"
                    "hello world"
                    "jerry loves programming"]

    code_snippet_emddings = get_embeddings(corpus)
    print(code_snippet_emddings.shape)

    d = code_snippet_emddings.shape[1]
    index = faiss.IndexFlatL2(d)
    print(index.is_trained)

    index.add(code_snippet_emddings)
    print(index.ntotal)

    k = 2
    xq = model.encode(["jerry loves candy"])

    D, I = index.search(xq, k)  # search
    print(I)
    print(D)

This code returns

[[0 1]]
[[1.3480902 1.6274161]]

But I cant find which sentence xq is matching with and not the matching scores only.

How can I find the top-N matching string from the corpus.

CodePudding user response:

To retrieve the query results, try something like this using the variables from your code.

[corpus[I] for i in I]

But if you have corpus as a np.array object, you can do some cool slicing like this:

import numpy as np

# If you corpus are in array form.
corpus = np.array(['abc def', 'foo bar', 'bar bar sheep'])

# And indices can be list of integers.
indices = [1,0]

# Results.
corpus[indices]

And it can get a little cooler if your indices are already np.array, like output of faiss, and if you have 2 queries with 1x2xk results:

import numpy as np

corpus = np.array(['abc def', 'foo bar', 'bar bar sheep'])

indices = np.array([[1,0], [0,2]])

corpus[indices]

Additional Notes

The faiss.IndexFlatL2 object returns these through the search() function:

  • labels – output labels of the NNs, size n*k
    • i.e. I in your code snippet refers to indices of the top-K results
  • distances – output pairwise distances, size n*k
    • i.e. D in your code snippet referring to the distance of the top-K results from your query string.

Since you have only 1 query, the n=1, therefore your I and D matrice are of size 1x1xk.

See also:

CodePudding user response:

You forgot some commas in your corpus, thus you only passed two sentences in the corpus.

As per the result, your ids are sequential since the IndexFlatL2 does not provide an add_with_ids method (unless you wrap it with an IndexIDMap) as stated in the documentation.

The [[0 1]] represent the indexes of the array and then you have the respective score. But if you expected 4 values it is because of the missing commas

  • Related