Git SHA-1 hash building: commit vs. tag?-CodePudding

I've successfully been able to figure out how to piece together a Git commit hash from the data.

For example, if I git cat-file -p <commit-hash>, I see:

tree b2b933453f1e264a48918b28e178f5af9f3dfd64
parent ed71bf918132a401fa234ff751526c06f3a14e9a
author alice <[email protected]> 1661867163 -0700
committer alice <[email protected]> 1661868191 -0700
gpgsig -----BEGIN PGP SIGNATURE-----

iQJGBAABCAAwFiEE1OufFrpPFIQJupdBQUppn0KVeM4FAmMOGK4SHGtlbnZvZ2dA
Z21haWwuY29tAAoJEEFKaZ9ClXjOh9AQAKXKdb9j7brxVvyvM9YTUgPSWIYxeOoW
kveCWLzrB312U7zEDTnJl/GPlL8bHpCHk/4 1PihDXshWspAz1pXxGIJwiP/yJwr
l 5BuZ4ZKpbK7jgVNr/LoTlCumnOKQ68BMb7ZWBTHtHap1IOC9FEm/oc4fu FyZc
jqBhaf90fNKwu5FEchvOZKOiBpBhb/YTaxWjsNKroioYKRu5uv7vlMOoOEktTVdZ
xGsvxvGJEfGOvuHJI4ag8YHOfR/F3/TrfYFNiLylynBOR2NlcmNjcfeHTdNpIDcS
HQnZOA80rRA YCf7hSOZ1cEjaZ250lQXpiLOXbaya8LeXOK2LXxaVbBmnKrP00GT
3aypCn55Y7oG6WZLlVYldYXgh6Es1nny63ICyvChC2EcT3qswet/BY3TP3GafM D
rcafqAmkHpe1pLNRJByrQ6aitojc0fbvGyXqMDbWhYH9aIlXj8hwr2UB9OGlf5UY
BAzDBMOE1GOvyq3KJK8HmLEmVfmf4oqm dUJ3FiyCdrtNVAhuek2BTigy o4c da
WK3MslJyHqaK76zUaL7wm/6tr54p1ORBA4D7XyppgJURsmRESsJ2y1lAYnsgD/4g
kb IsJ9ClZIIHfDOF1vkGT9yAyGvwgxqox546vFuCqlJq1VfniX0Otc3sgBKbbgs
vBWo9Lyp6bUH
=2fOB
-----END PGP SIGNATURE-----

hello word

I can construct an algorithm that makes a Call to SHA-1 using each "piece" of information (tree, author, committer, etc). There are some caveats like doing:

treeStr = "tree "   inputTree   "\n"

and then each piece can be SHA-1'd together to do:

data = encode(treeStr, parentsStr, authorStr, committerStr, sigStr, msgStr)
commitHash = sha1(encode("commit ", data.length, 0x0, data)

and surprisingly enough, commitHash matches up with the git hash the commit gets. Based on my experimentation, there is some strangeness with newlines but the format for how these should be handled is something like:

tree ***\n
parent ***\n
author *** *** ***\n
commiter *** *** ***\n
gpgsig ***\n
\n***\n

Where *** represents the actual normal input.

Now I am trying to do the same for annotated tags, but struggling to get the same SHA-1 output that git gives.

if I do the same git cat-file -p <tag-hash>, I see:

object e912eb04010e4f2341968b3e65ab821069e9175a
type commit
tag v0.1.0
tagger alice <[email protected]> 1661867163 -0700

my tag msg
-----BEGIN PGP SIGNATURE-----

iQJGBAABCAAwFiEE1OufFrpPFIQJupdBQUppn0KVeM4FAmMOHI8SHGtlbnZvZ2dA
Z21haWwuY29tAAoJEEFKaZ9ClXjOJaMQAKToAHC/swEXvhiViLVhUoD3o oI889I
w/Qs5Df7HsFHdSBMRlXImMwy27QeLbUIf72CFo7qTvK7/NM5tH vh3r2Goi21 d0
 XUcTkV8Bx54NrbL6yz/NEwmv6RndOlJIip iHyp8r3N19ZFt3sQqEslupIRi9cK
ao1Je9h YirM4dmfgt4Jx wZep6IEOpm3FOAMNrWYPwoM2B9v/PeVoP 59UyhL6d
1YguK0UTt6onOIR2RnpC4 9ETirr9ncIN1jrIXYT//oBIJ1e7OpTaI6jKnxh8zzA
pK0aWXj7Ck8MU jd8EvV53t6LgClT1162HF7 GX5qQIPwr16cVU pu6yGwHRp8Gw
/IjLVYLshpWLFR2iW6NIMCtdD uDZWC5wY0F1zbobKTXJXNVPhx/7v8HZwDLZVoC
ai8zLzW B1zQgTzzrKbVSSq60mRiqovvkGuV7uFIzYXOpbAQ0M1DRecaWFi HdYh
pvnCvDpguEiLcrqj2ZnE/QIbhEYxFafBTgVmuOYKdidEY7te8xaAMg0cqnRNW8Qb
JBdgzV6SSM zVYG52LjdkGRnP6LWwqLhq0tVxhA7OqAzLaAyA7pmWnZ5/goG/d1d
drsXO8IJVAe0cSHTCwUsrzbrNe8Lc7zl3iy/NeWM6Z0kKUm2A8KKBFZDWYe3WM b
WBdTJh4esBHc
=3pj0
-----END PGP SIGNATURE-----

Some obvious differences stand out:

the message now comes before the PGP sig
the PGP sig is not prefixed with "gpgsig "

No matter how many combinations of weird newlines I try, I cannot get my algorithm to correctly build a SHA-1 hash based on this output.

The most logical format for how I see this is:

object ***\n
type ***\n
tag ***\n
tagger *** *** ***\n
\n***\n
***\n

But again, so far that has been unsuccessful. Even accounting for the switched message and sig:

data = encode(objectStr, typeStr, tagStr, taggerStr, msgStr, sigStr)
tagHash = sha1(encode("tag ", data.length, 0x0, data)

tagHash does not match what Git gives.

Questions:

Is there anything else that's different about annotated tag generation? Is my format assumptions wrong?
Is the output of git cat-file equivalent to how I should be piecing this together? I am assuming so because the commit worked that way, but would be nice to confirm.
Git is open-source, right? Does anyone have a link for how they build the SHA-1 hash? Would be awesome to do this in a way that isn't guess and check.

Thanks in advance for the help, really struggling with this!

CodePudding user response：

Yes, git cat-file -p <object> outputs the exact content of the object stored in git (\n and \r\n and all included).

Here is one way to generate the complete object's content (including the leading tag <lenght>\0) from the terminal :

printf "tag %d\0" $(git cat-file -p mytag | wc -c); git cat-file -p mytag

# to check if it has the correct hash:
(print "tag ...) | sha1sum

You should compare the output of this command with the content generated by your code :

encode("tag ", data.length, "\0", data)

for example :

(printf "tag ...) > expected.txt
node myscript.js  > got.txt

diff expected.txt got.txt || echo "** content is different"

I was asking about your OS and terminal because, on Windows, some tools or libraries may insert \r\n insert of \n when you ask for a newline, and Powershell is known to meddle with your output without you knowing -- the parameters to set to just have UTF-8 as output are tricky to set right, and until a few versions ago it would always "helpfully" insert a BOM at the beginning of your output.

This wouldn't apply to strings manipulated within your program, but this may be leading you to think that your code produce weird things when it is actually the shell.

CodePudding user response：

First: yes, Git is open-source; see the public mirror here for instance.

The hash of an annotated tag is that of the tag's data including the signature:

$ git rev-parse v2.35.0
38fc0d036c2e0267736249eae49fb9df786fe87b
$ git cat-file -p v2.35.0
object 89bece5c8c96f0b962cfc89e63f82d603fd60bed
type commit
tag v2.35.0
tagger Junio C Hamano <[email protected]> 1643045149 -0800

Git 2.35
-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEE4fA2sf7nIh/HeOzvsLXohpav5ssFAmHu4R0ACgkQsLXohpav
5svJJRAApDztS5mUxex7HjmgJ4jltnrzcJNQV0ks2XwK2iM9/aj5ISwWOuWT3 WK
cwP6u7vVniO33sWkxN0j7g D HBOvCf2ZW2MKWNp3bKEUSKx4IxOmDfdyop2YER2
fUm2D3S0sR6zbXWOz3WQAlLwlpsU1DwXLW/bnsZHPux7lhvKvOf/dpymVVbsiDz8
tNdEJVeuBcqLX/Sc8Lpk8yjGHxgaRS1Eq24apwim8i1GG32ies30eWVXhkMf7VlY
JAwdMl5phtoQbEdv7wKR7kbAetjHE/HU5Oz4xkqiHf dfqSyFdPtDG2GZfdroqFT
XHnX6goJIFphg3sfuuu1WRJWGMiPRPEOYZ4wxm nQujFB7oo7RgB82cjIW65J/ur
CgNznRc7YfMfkEWmpZdQkK/jCiPkRy AMKzCs0Dmt g4NJy 4f9EZUpgTddslT7X
renX79 Crs/UMhQBT7bl04FKwbCbrtBMtnO0ZPAc48svUP x7Sc9cnn4X2DCBALR
TeERWOLM AIlOWgLFvkbbdM4NUQfTD4 XNEJvWwkKk5YujVjBjUw10TQR1IuADVr
zp9i6pUkHbB6XflUemEg9pZOj8TbxwPDvTasANxmeSpnwLW2c0n2o066HFk37nPX
jiYm AatnQg0/V6uNSilYH2UV176 uDEc0WkznImbiyEn11jHBs=
=VCUV
-----END PGP SIGNATURE-----
$ git cat-file -p v2.35.0 | git hash-object -t tag --stdin
38fc0d036c2e0267736249eae49fb9df786fe87b

Note how the produced hash matches the tag. We can reproduce this in Python:

>>> import hashlib
>>> import subprocess
>>> p = subprocess.Popen("git cat-file -p v2.35.0", shell=True, stdout=subprocess.PIPE)
>>> s = p.stdout.read()
>>> p.wait()
0
>>> prefix = f"tag {len(s)}\0".encode('utf8')
>>> prefix
b'tag 974\x00'
>>> h = hashlib.sha1()
>>> h.update(prefix)
>>> h.update(s)
>>> h.hexdigest()
'38fc0d036c2e0267736249eae49fb9df786fe87b'

which shows there's nothing special going on here.

This is the same as for commits, signed or otherwise. The raw data is the raw data; it gets prefixed with the object type (one of "blob", "commit", "tag", or "tree"), an ASCII space, the decimalized length of the object also in ASCII, and a NUL byte. We then compute a SHA1 (or, with the new SHA-256 mode, a SHA256) hash of those bytes and that's the Git-level object hash ID.

Note that the signature in the tag or in the commit is a GPG or ssh or whatever digital signature of some of the data, not all of it: we can't sign our own signature until we have the signature, so there's a chicken-and-egg issue here. So we sign the commit or tag minus the signature itself, then encode the signature bytes, then compute the hash of the whole thing.