Accessing Universal-Sentence-encoder training vocabulary-CodePudding

I'm basing this question off of this similar question, but the multilingual universal embeddings have a slightly different structure:

saved_model = loader_impl.parse_saved_model("/path_to/universal_sent_encoder")
graph = saved_model.meta_graphs[0].graph_def

fns = [f for f in graph.library.function if "ptb" in str(f).lower()][0].node_def
print(len(fns))
>>> 1272

nodes = [n for n in fns if 'SentencepieceOp' in n.name]
model_string = nodes[0].attr.get('model').s

I see a byte string with what I assume is a compressed list/dict of tokens:

model_string[100:200]
>>> b"\x19\n\x10extra_token_id_3\x15\x00\x00\x00\x00\x18\x04\n\n\n\x03\xe2\x96\x81\x15_\xbaU\xc0\n\x08\n\x01,\x15~\xdac\xc0\n\x08\n\x01.\x15\x08\xf6d\xc0\n\x08\n\x01s\x15\xe8\xa8\x8b\xc0\n\x0b\n\x04\xe2\x96\x81a\x15\xaf \x9b\xc0\n\x08\n\x01'\x15j\xe9\x9b\xc0\n\r\n\x06\xe2\x96\x81th"

But i've tried multiple ways of uncompressing this:

decoded_model_string = codecs.decode(model_string, 'ISO-8859-1') # decodes just fine


pickle.loads(model_string)
>>>
 UnpicklingError                           Traceback (most recent call last)
<ipython-input-183-857101d05cb4> in <module>
----> 1 pickle.loads(model_string)

UnpicklingError: invalid load key, '\x0a'

pickle.loads(model_string.encode('utf-8'))

>>>
 UnpicklingError                           Traceback (most recent call last)
<ipython-input-183-857101d05cb4> in <module>
----> 1 pickle.loads(model_string)

UnpicklingError: invalid load key, '\x0a'

I've also tried the tensorflow.io.decode_raw but also run into utf decoding errors.

CodePudding user response：

Took a bit but I had to load the by

import sentencepiece as spm

sp_model = spm.SentencePieceProcessor()
sp_model.LoadFromSerializedProto(model_string)
vocab = {sp_model.IdToPiece(i): i for i in range(sp_model.GetPieceSize())}