I'm basing this question off of this similar question, but the multilingual universal embeddings have a slightly different structure:
saved_model = loader_impl.parse_saved_model("/path_to/universal_sent_encoder")
graph = saved_model.meta_graphs[0].graph_def
fns = [f for f in graph.library.function if "ptb" in str(f).lower()][0].node_def
print(len(fns))
>>> 1272
nodes = [n for n in fns if 'SentencepieceOp' in n.name]
model_string = nodes[0].attr.get('model').s
I see a byte string with what I assume is a compressed list/dict of tokens:
model_string[100:200]
>>> b"\x19\n\x10extra_token_id_3\x15\x00\x00\x00\x00\x18\x04\n\n\n\x03\xe2\x96\x81\x15_\xbaU\xc0\n\x08\n\x01,\x15~\xdac\xc0\n\x08\n\x01.\x15\x08\xf6d\xc0\n\x08\n\x01s\x15\xe8\xa8\x8b\xc0\n\x0b\n\x04\xe2\x96\x81a\x15\xaf \x9b\xc0\n\x08\n\x01'\x15j\xe9\x9b\xc0\n\r\n\x06\xe2\x96\x81th"
But i've tried multiple ways of uncompressing this:
decoded_model_string = codecs.decode(model_string, 'ISO-8859-1') # decodes just fine
pickle.loads(model_string)
>>>
UnpicklingError Traceback (most recent call last)
<ipython-input-183-857101d05cb4> in <module>
----> 1 pickle.loads(model_string)
UnpicklingError: invalid load key, '\x0a'
pickle.loads(model_string.encode('utf-8'))
>>>
UnpicklingError Traceback (most recent call last)
<ipython-input-183-857101d05cb4> in <module>
----> 1 pickle.loads(model_string)
UnpicklingError: invalid load key, '\x0a'
I've also tried the tensorflow.io.decode_raw but also run into utf decoding errors.
CodePudding user response:
Took a bit but I had to load the by
import sentencepiece as spm
sp_model = spm.SentencePieceProcessor()
sp_model.LoadFromSerializedProto(model_string)
vocab = {sp_model.IdToPiece(i): i for i in range(sp_model.GetPieceSize())}