This doesn't work when using unicode characters (in Ubuntu bash):
$ perl -pC -e's/[à]/a/gu' <<< 'à'
à
$ perl -pC -e's/[b]/a/gu' <<< 'b'
a
Even though it seems to be supported by PCRE (at least according to regex101).
What am I doing wrong? Am I missing some flag in the perl command?
This "just works" in javascript, so I would be using node if I could come up with a simple one-liner for this in command line ... but I still want to know why the perl command is not working.
For context:
I'm trying to use substitutions like /[àâáãä]/a/g
, /[òôóõö]/o/g
, etc to asciify a dictionary file (i.e. remove accents, etc. of a word list), so I can use it to make spell-checking accent-insensitive (e.g. in IntelliJ Idea).
Basically these are the steps to make an "asciified" extra dictionary:
- Download the .dic file for the language (list of all words)
- Use grep to filter words containing non-ascii / replaceable characters
- Use regex substitutions in succession to make words accent-insensitive
- Import the asciified .dic file in the IDE (in addition to the standard language dictionary)
CodePudding user response:
One practical approach for all of it is to use Text::Unidecode
perl -C -MText::Unidecode -pe'unidecode($_)' <<< 'à'
Prints a
. The module transliterates Unicode text into plain ASCII.
Another approach: decompose characters ("normalize") using Unicode::Normalize, so that the character and its diacritical marks (combining accents) are separated into their own code points, while they still form a valid grapheme, then remove the diacriticals (\p{NonspacingMark}
or \p{Mn}
) with a simple regex.
Both of these ways will have exceptions and edge cases but I think it may just do what you need.
As for code containing specific (literal) characters, need to tell Perl that the program source is then UTF-8, via the utf8 pragma with use utf8;
, or with a command-line flag -Mutf8
perl -C -Mutf8 -pe's/[à]/a/g' <<< 'à'
CodePudding user response:
I tried
bes:~ $ perl -pC -Mutf8 -e's/[à]/a/gu' <<< 'à'
a
and I worked for me using bash on Linux.