Home > Blockchain >  How to find Bidirectional Unicode control characters in git repo?
How to find Bidirectional Unicode control characters in git repo?

Time:03-28

I have a package on NPM that shows that it contain "Bidirectional unicode control characters" reported by socket.dev.

I've found answer to this question How to update GitHub Actions CI to detect Trojan Code commits (malicious [bidirectional] unicode chars, python).

I've used:

git grep -oP "[^\x00-\x7F]*"

It found some matches in binary files, so I've removed all binary flags from .gitattributes and now I only have files from __tests__ directory that has ANSI files and one image, but those are not published to NPM.

What is a proper way to find those: "Bidirectional unicode control characters"?

CodePudding user response:

The perl one liner

perl -CSD -ne 'print "$ARGV: $_" if /\p{Bidi_Control}/' file1 file2 ...

will print out lines of the given UTF-8 encoded files that have bidirectional control characters in them.

Unfortunately, PCRE, which git grep -P uses, doesn't support anywhere near the level of unicode properties that perl regular expressions do. You can search for the control characters explicitly, though:

$ git clone https://github.com/jcubic/jquery.terminal.git
$ cd jquery.terminal
$ git grep -IP '(*UTF)[\x{061C}\x{200E}\x{200F}\x{202A}-\x{202E}\x{2066}-\x{2069}]'
js/jquery.terminal-src.js:        "&lrm;": "<U 200E>",
js/jquery.terminal-src.js:        "&rlm;": "<U 200F>",

The -I option skips binary files.

(List of control characters taken from perluniprops).

  • Related