My application uses protocol buffers and has a large number (100 million) of simple messages. Based on callgrind analysis, a memory allocation & deallocation is being made for each instance.
Consider the following representative example:
// .proto
syntax = "proto2";
package testpb;
message Top {
message Nested {
optional int32 val1 = 1;
optional int32 val2 = 2;
optional int32 val3 = 3;
}
repeated Nested data = 1;
}
// .cpp
void test()
{
testpb::Top top;
for (int i = 0; i < 100'000; i) {
auto* data = top.add_data();
data->set_val1(i);
data->set_val2(i*2);
data->set_val3(i*3);
}
std::ofstream ofs{"file.out", std::ios::out | std::ios::trunc | std::ios::binary };
top.SerializeToOstream(&ofs);
}
What is the most effective option for changing the implementation such that the # of memory allocations are not linear with the # of Nested
instances?
CodePudding user response:
I would suggest using Arena allocations which were designed for exactly this purpose. https://developers.google.com/protocol-buffers/docs/reference/arenas
Memory allocation and deallocation constitutes a significant fraction of CPU time spent in protocol buffers code. By default, protocol buffers performs heap allocations for each message object, each of its subobjects, and several field types, such as strings. These allocations occur in bulk when parsing a message and when building new messages in memory, and associated deallocations happen when messages and their subobject trees are freed.
Arena-based allocation has been designed to reduce this performance cost. With arena allocation, new objects are allocated out of a large piece of preallocated memory called the arena. Objects can all be freed at once by discarding the entire arena, ideally without running destructors of any contained object (though an arena can still maintain a "destructor list" when required). This makes object allocation faster by reducing it to a simple pointer increment, and makes deallocation almost free. Arena allocation also provides greater cache efficiency: when messages are parsed, they are more likely to be allocated in continuous memory, which makes traversing messages more likely to hit hot cache lines.
To get these benefits you'll need to be aware of object lifetimes and find a suitable granularity at which to use arenas (for servers, this is often per-request). You can find out more about how to get the most from arena allocation in Usage patterns and best practices.
This would change your allocations to look more like
google::protobuf::Arena arena;
testpb::Top* top = google::protobuf::Arena::CreateMessage<testpb::Top>(&arena);