Home > Net >  How can I use unicode characters in perl regex substitution command?
How can I use unicode characters in perl regex substitution command?

Time:12-15

This doesn't work when using unicode characters (in Ubuntu bash):

$ perl -pC -e's/[à]/a/gu' <<< 'à'
à
$ perl -pC -e's/[b]/a/gu' <<< 'b'
a

Even though it seems to be supported by PCRE (at least according to regex101).

What am I doing wrong? Am I missing some flag in the perl command?

This "just works" in javascript, so I would be using node if I could come up with a simple one-liner for this in command line ... but I still want to know why the perl command is not working.


For context:

I'm trying to use substitutions like /[àâáãä]/a/g, /[òôóõö]/o/g, etc to asciify a dictionary file (i.e. remove accents, etc. of a word list), so I can use it to make spell-checking accent-insensitive (e.g. in IntelliJ Idea).

Basically these are the steps to make an "asciified" extra dictionary:

  1. Download the .dic file for the language (list of all words)
  2. Use grep to filter words containing non-ascii / replaceable characters
  3. Use regex substitutions in succession to make words accent-insensitive
  4. Import the asciified .dic file in the IDE (in addition to the standard language dictionary)

CodePudding user response:

One practical approach for all of it is to use Text::Unidecode

perl -C -MText::Unidecode -pe'unidecode($_)'  <<< 'à'

Prints a. The module transliterates Unicode text into plain ASCII.

Another approach: decompose characters ("normalize") using Unicode::Normalize, so that the character and its diacritical marks (combining accents) are separated into their own code points, while they still form a valid grapheme, then remove the diacriticals (\p{NonspacingMark} or \p{Mn}) with a simple regex.

Both of these ways will have exceptions and edge cases but I think it may just do what you need.


As for code containing specific (literal) characters, need to tell Perl that the program source is then UTF-8, via the utf8 pragma with use utf8;, or with a command-line flag -Mutf8

perl -C -Mutf8 -pe's/[à]/a/g' <<< 'à'

CodePudding user response:

I tried

bes:~ $ perl -pC -Mutf8 -e's/[à]/a/gu' <<< 'à'
a

and I worked for me using bash on Linux.

  • Related