Home > front end >  What is a protobuf message?
What is a protobuf message?

Time:07-04

I'm learning how to use tf.records and in the official tutorial they mention you can print a tf.train.Example message (which is a primitive of the protobuf protocol if I get it right).

I understand that tf.records are used to serialize the data, and that they use the protobuf protocol in this case. I also understand that using tf.train.Feature, tf.train.Features and tf.train.Example one can convert the data into the right format.

My question is what does it mean to print a messege in this context? (the tutorial shows how to print an tf.train.Example message)

CodePudding user response:

A message is classically thought of as a collection of bytes that are conveyed from one process/thread to another process/thread. Typically (but not necessarily), the collection of bytes means something to the sender and receiver, e.g. it's an object that has been serialised somehow (perhaps using Google Protocol Buffers). So, an object can become a message by serialising it and placing the bytes into an array that one might term a "message".

It's not necessarily the case the processes handling the collection of bytes will deserialise them. For example, a process that is simply going to pass them onwards down another connection need not actually deserialise them, if it already knows where the bytes are supposed to be sent.

The means by which a message is conveyed is typically some sort of queue / pipe / socket / stream / etc. Where it gets interesting is that most data transports of this sort are stream connections; whatever bytes you push in one end comes out the other. So, then, how to use those for sending messages?

The answer is that there has to be some way of demarcating between messages. There's lots of ways of doing that, but these days it makes far more sense to use something like ZeroMQ, which takes care of all that for you (and more besides). ZeroMQ is a library / protocol that allows a program to transfer a collection of bytes from one process/thread to another via stream connections, and ensure that the receiving program gets the collection in one nice and complete buffer. Those bytes could be objects serialised by Google Protocol Buffer, or serialised in some other way (there's lots). HTTP is also used as a way of moving objects around, e.g. a page of HTML.

So the pattern is object -> serialisation -> message buffer -> some sort of byte transport that demarcates one message from another -> message buffer -> deserialisation -> object.

An advantage of serialisations like Protocol Buffers is that the sender and receiver need not be written in the same language, or share anything at all except for the .proto file. Other approaches to serialisation often involves marking up class definitions in the program source code, which then makes it difficult to deserialise data in another language.

Also in languages like C/C one might get away with simply copying the bytes at the object's address from one place to another. This can be a total disaster if the destination is a different machine; endianness etc. can matter a lot. There are serialisation standards that get close to this, specifically Cap'n Proto (see this).

There are variations. Within a process, "passing a message" can simply mean passing ownership of an object around. Ownership can be by convention, i.e. if I've just written the object pointer to a message queue, I won't mutate the object anymore. I think in Rust it's even expressed by the language syntax, in that once object ownership has been given up the language won't let you mutate the object (worked out at compile time, part of what makes Rust so good). The net result looks like message transfer, but in fact all that's happened is a pointer (typically, 64bits) has been copied from A to B, not the entire data in the object. This is a lot faster.

  • Related