I came across a repo with a file like this: til/LINQサンプル.cs/LINQサンプル.cs/Program.cs
, which was encoded in git-log as "LINQ\343\202\265\343\203\263\343\203\227\343\203\253.cs/LINQ\343\202\265\343\203\263\343\203\227\343\203\253.cs/Program.cs"
What encoding does Git use for file names with non-ASCII characters?
What I already tried
- using different
--encoding
param values - reading git-log docs
- converting unicode text into bytes:
- Text (4 glyphs): サンプル
- Encoded in log (12 numbers): \343\202\265\343\203\263\343\203\227\343\203\253
- Text as bytes (12 bytes): 227 130 181 227 131 179 227 131 151 227 131 171
CodePudding user response:
What encoding does Git use for file names with non-ASCII characters?
tl;dr. Git stores whatever bytes the filesystem does. In your case, \343\202\265
is octal (base 8). Converting to hex gives e382b5
. That's the UTF-8 encoding for サ. git-log
, by default, will interpret filenames as UTF-8.
Git stores filenames in tree objects, akin to a directory. You can see the top level tree object for any commit by adding ^{tree}
. git cat-file -p HEAD^{tree}
shows the top level tree object for your current checkout.
For example, if we have the file til/LINQサンプル.cs
we would see...
git cat-file -p HEAD^{tree}
040000 tree 4ef35381184b94ea9e9114a9ab37a9ed2061f598 til
This says til is a tree object (directory) with the ID 4ef35381184b94ea9e9114a9ab37a9ed2061f598. If we examine that tree object...
$ git cat-file -p 4ef35381184b94ea9e9114a9ab37a9ed2061f598
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 LINQサンプル.cs
That says til/ contains the file LINQサンプル.cs with permissions 0644 stored in the blob object e69de29bb2d1d6434b8b29ae775ad8c2e48c5391.
If we look at that tree object, we see...
100644 LINQ\343\202\265\343\203\263\343\203\227\343\203\253.cs
Which is UTF-8 encoding for LINQサンプル.cs.