Home > Software design >  I need help replacing a russian phrase in Perl Windows
I need help replacing a russian phrase in Perl Windows

Time:04-14

perl -pi -e "s/\x22message\x22\s \x22Боже, ты посмотри вокруг, что происходит!\x22/\x22message\x22 \x22\x22/g;" "D:\Sav\scripts\chat_wheel.txt"

There is nothing wrong with this command except for the russian text portion that i want to remove

Боже, ты посмотри вокруг, что происходит!

When i run it in cmd.exe, i get the following error message.

Nested quantifiers in regex; marked by <-- HERE in m/\x22message\x22\s \x22??? <-- HERE ?, ?? ???????? ??????, ??? ??????????!\x22/ at -e line 1.

So, how do i replace the russian phrase while keeping the command as a one-liner? is it even possible?

My console uses CP 65001 (UTF-8). [From Win32::GetConsoleCP()]
My Active Code Page (ACP) is 1252 [From Win32::GetACP()].
My file is encoded using UTF-8.

CodePudding user response:

Б is being replaced by ?. This is because it's not supported by the console's code page, by the active code page, or both.

Your console's code page is set to 65001, or UTF-8. Your console can therefore handle any character in the Unicode character set. The problem is obviously not here.

Windows system calls that deal with strings come in two varieties. A "W"ide variety that uses UTF-16le, and an "A"NSI variety which uses the Active Code Page. If Perl used the "W" interface to grab its command line parameters, we wouldn't be having this problem. Perl instead uses the "A" interface for this (and all other) system call.

This means Perl can only accept command-line arguments which can be represented by the Active Code Page. In your case, it's 1252, and the cp1252 character set doesn't include any cyrillic characters.

Assuming we don't want to replace every character with an escape (like how you replaced double-quotes with "), we're going to need do something different.

Since we can't pass the script using an argument, we'll need to provide it using a file rather than using -e. Or via a pipe.

echo s/"message"\s "\KБоже(?=")// | perl -i -p - file.txt

A better but more drastic solution would be to change Perl's ACP to 65001.


There's a second issue.

Perl expects its source code to be encoded using (8-bit clean) ASCII unless you provide use utf8;. So while you think you're passing s/...Боже...//, you're really passing s/...\xD0\x91\xD0\xBE\xD0\xB6\xD0\xB5...//.

This works out ok because your don't decode your input file either. But it can lead to surprises. For example, "Б" =~ /^[Бж]\z/ ("\xD0\x91" =~ /^[\xD0\x91\xD0\xB6]\z/) would return false!

To fix this in a script, you'd use

use utf8;                              # Source code is using UTF-8.
use open ':std', ':encoding(UTF-8)';   # Terminal provides & expects UTF-8.

-C will do here.

echo s/"message"\s "\KБоже(?=")// | perl -i -C -p - file.txt
  •  Tags:  
  • perl
  • Related