I am trying to do semantic search with sentence transformer and faiss.
I am able to generate emebdding from corpus and perform query with the query xq
.
But what are t
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("flax-sentence-embeddings/st-codesearch-distilroberta-base")
def get_embeddings(code_snippets: str):
return model.encode(code_snippets)
def build_vector_database(atlas_datapoints):
dimension = 768 # dimensions of each vector
corpus = ["tom loves candy",
"this is a test"
"hello world"
"jerry loves programming"]
code_snippet_emddings = get_embeddings(corpus)
print(code_snippet_emddings.shape)
d = code_snippet_emddings.shape[1]
index = faiss.IndexFlatL2(d)
print(index.is_trained)
index.add(code_snippet_emddings)
print(index.ntotal)
k = 2
xq = model.encode(["jerry loves candy"])
D, I = index.search(xq, k) # search
print(I)
print(D)
This code returns
[[0 1]]
[[1.3480902 1.6274161]]
But I cant find which sentence xq
is matching with and not the matching scores only.
How can I find the top-N matching string from the corpus.
CodePudding user response:
To retrieve the query results, try something like this using the variables from your code.
[corpus[I] for i in I]
But if you have corpus as a np.array
object, you can do some cool slicing like this:
import numpy as np
# If you corpus are in array form.
corpus = np.array(['abc def', 'foo bar', 'bar bar sheep'])
# And indices can be list of integers.
indices = [1,0]
# Results.
corpus[indices]
And it can get a little cooler if your indices are already np.array, like output of faiss, and if you have 2 queries with 1x2xk
results:
import numpy as np
corpus = np.array(['abc def', 'foo bar', 'bar bar sheep'])
indices = np.array([[1,0], [0,2]])
corpus[indices]
Additional Notes
The faiss.IndexFlatL2
object returns these through the search()
function:
- labels – output labels of the NNs, size n*k
- i.e.
I
in your code snippet refers to indices of the top-K results
- i.e.
- distances – output pairwise distances, size n*k
- i.e.
D
in your code snippet referring to the distance of the top-K results from your query string.
- i.e.
Since you have only 1 query, the n=1
, therefore your I
and D
matrice are of size 1x1xk
.
See also:
CodePudding user response:
You forgot some commas in your corpus, thus you only passed two sentences in the corpus.
As per the result, your ids are sequential since the IndexFlatL2 does not provide an add_with_ids method (unless you wrap it with an IndexIDMap) as stated in the documentation.
The [[0 1]]
represent the indexes of the array and then you have the respective score. But if you expected 4 values it is because of the missing commas