Home > front end >  How to decode escaped file names in git-log?
How to decode escaped file names in git-log?

Time:11-03

I came across a repo with a file like this: til/LINQサンプル.cs/LINQサンプル.cs/Program.cs, which was encoded in git-log as "LINQ\343\202\265\343\203\263\343\203\227\343\203\253.cs/LINQ\343\202\265\343\203\263\343\203\227\343\203\253.cs/Program.cs"

What encoding does Git use for file names with non-ASCII characters?

What I already tried

  • using different --encoding param values
  • reading git-log docs
  • converting unicode text into bytes:
    • Text (4 glyphs): サンプル
    • Encoded in log (12 numbers): \343\202\265\343\203\263\343\203\227\343\203\253
    • Text as bytes (12 bytes): 227 130 181 227 131 179 227 131 151 227 131 171

CodePudding user response:

What encoding does Git use for file names with non-ASCII characters?

tl;dr. Git stores whatever bytes the filesystem does. In your case, \343\202\265 is octal (base 8). Converting to hex gives e382b5. That's the UTF-8 encoding for サ. git-log, by default, will interpret filenames as UTF-8.


Git stores filenames in tree objects, akin to a directory. You can see the top level tree object for any commit by adding ^{tree}. git cat-file -p HEAD^{tree} shows the top level tree object for your current checkout.

For example, if we have the file til/LINQサンプル.cs we would see...

git cat-file -p HEAD^{tree}
040000 tree 4ef35381184b94ea9e9114a9ab37a9ed2061f598    til

This says til is a tree object (directory) with the ID 4ef35381184b94ea9e9114a9ab37a9ed2061f598. If we examine that tree object...

$ git cat-file -p 4ef35381184b94ea9e9114a9ab37a9ed2061f598
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    LINQサンプル.cs

That says til/ contains the file LINQサンプル.cs with permissions 0644 stored in the blob object e69de29bb2d1d6434b8b29ae775ad8c2e48c5391.

If we look at that tree object, we see...

100644 LINQ\343\202\265\343\203\263\343\203\227\343\203\253.cs

Which is UTF-8 encoding for LINQサンプル.cs.

  •  Tags:  
  • git
  • Related