Home > Software engineering >  Why do git objects include a length and a delimiter as metadata?
Why do git objects include a length and a delimiter as metadata?

Time:12-08

I'm doing a git course, learning about git objects.

It's not entirely clear why every git object stores metadata the way it does.

Every object prepends a header:

header = "blob #{content.length}\0"

Apparently, there are 4 types of git objects:

  • Blob
  • Tree
  • Commit
  • Annotated Tag

That's 4 possible options, or 2 bits of data. Even allowing for future expansion, this could have been made into a single byte, prepended to each object.

Knowing that the first byte will always be the type, and having the filesystem tell you the length of the file, you can easy calculate the length of the data in the object as file_size - 1 byte. This removes the need for a length field, or a delimiter, since your metadata is now a constant length.

Even with the current design, where the type field is a variable length string, knowing the object's file size, which the filesystem (ex: ext4) will tell you, and the header length, which you can figure out by reading up to, and including the delimiter, it seems like you can easily figure out the length of the data stored inside the object (file_size - header_length).

Is there a reason Torvalds (or whoever) chose to use a string to represent the object type, to include the content length (which can be calculated), and to include a delimiter?

CodePudding user response:

I can only guess*, too, but the most probable reason would be: robustness.

Knowing that the first byte will always be the type, and having the filesystem tell you the length of the file

Which assumes the data comes from a file and that the filesystem is correct!

What if the file object was not completely written to disk (missing bytes) or for what ever reason has additional data (extra bytes)? With the expected length explicitly specified these error conditions are trivially detectable and sometimes even correctable.

Explicit lengths on the other hand allow for easy concatenation of multiple objects and reading/transferring them in a stream (e.g. into memory, over the network - albeit this rather uses the pack format).

Additionally - seen from the coder's perspective - knowing the size of the objects beforehand allows to allocate appropriately sized buffers before reading the objects from disk etc.

* This maybe should have been a comment, since it is not a real answer, but I felt it to be too long for a comment

  •  Tags:  
  • git
  • Related