How to obtain SentenceTransformer vocab from corpus or query?-CodePudding

I am trying SentenceTransformer model from SBERT.net and I want to know how it handles entity names. Are they marked as unknown - are they broken down with tokens, etc. I want to make sure they are used in the comparison.

However, to do that I would need to see the vocab it build for the query - and perhaps even convert an embedding to text.

Looking at the api - its not obvious to me how to do that.

Here is a quick example from their docs:

embedder = SentenceTransformer("all-MiniLM-L6-v2")

corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby."
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = [
    "A man is eating pasta."
]

top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    ...

CodePudding user response：

Your SentenceTransformer model is actually packing and using a tokenizer from Hugging Face's transformers library under the hood. You can access it as the .tokenizer attribute of your model. The typical behaviour of such a token is to break down unknown tokens in word piece tokens. At this point, we can go on and check that it's indeed what it does, as it is relatively straightforward:

embedder = SentenceTransformer("all-MiniLM-L6-v2")

corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby."
]

# the tokenizer is just here:
tokenizer = embedder.tokenizer  # BertTokenizerFast

# and the vocabulary itself is there, if needed:
vocab = tokenizer.vocab  # dict of length 30522

# get the split of sentences according to the vocab, for example:
inputs = tokenizer(corpus, padding='longest', truncation=True)
tokens = [e.tokens for e in inputs.encodings]
# tokens contains:
# [
#   ['[CLS]', 'a', 'man', 'is', 'eating', 'food', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]']
#   ['[CLS]', 'a', 'man', 'is', 'eating', 'a', 'piece', 'of', 'bread', '.', '[SEP]']
#   ['[CLS]', 'the', 'girl', 'is', 'carrying', 'a', 'baby', '.', '[SEP]', '[PAD]', '[PAD]']
# ]

# now let's try with some unknown tokens and see what it does
queries = [
    "Edv Beq is eating pasta."
]
q_inputs = tokenizer(queries, padding='longest', truncation=True)
q_tokens = [e.tokens for e in q_inputs.encodings]
# q_tokens contains:
# [
#   ['[CLS]', 'ed', '##v', 'be', '##q', 'is', 'eating', 'pasta', '.', '[SEP]']
# ]