When I run echo é | tr é e
i get ee
, not the e
I was expecting.
Here's the result of the command locale
:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
CodePudding user response:
Looks like tr
doesn't handle wide characters well.
$ echo 'é' | od -c
0000000 � � \n
0000003
$ echo 'é' | tr 'é' e | od -c
0000000 e e \n
0000003
Since the left-hand character set is 2 characters long, tr
repeats the last character of the right-hand set until it is the same length.
$ echo 123456789 | tr 2468 xy
1x3y5y7y9
You might prefer sed
for handling non-ASCII characters.
$ echo 'é' | sed 's/é/e/g' | od -c
0000000 e \n
0000002
$ echo 'é' | sed 'y/é/e/' | od -c
0000000 e \n
0000002
CodePudding user response:
'é' is two characters wide and will produce 'ee' when using tr
.
$ echo 'é' | tr 'é' 'e'
ee
Transliterate 'é' into two characters, a literal 'e' followed by '\b' a backspace to remove the second 'e'.
$ echo 'é' | od -c
0000000 � � \n
0000003
$ echo é | tr 'é' 'e\b' |od -c
0000000 e \b \n
0000003
EDIT: This approach is flawed. $ echo é | tr 'é' 'e\b'
looks like it works but may produce odd results. While you see an 'e' on your terminal in actuality your terminal displays 'e\b'. That is an 'e' followed by a non-printing backspace character. Better off using sed
. sed 'y|é|e|'