Home > Back-end >  Why does tr fail with é?
Why does tr fail with é?

Time:03-19

When I run echo é | tr é e i get ee, not the e I was expecting.

Here's the result of the command locale :

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

CodePudding user response:

Looks like tr doesn't handle wide characters well.

$ echo 'é' | od -c
0000000   �   �  \n
0000003
$ echo 'é' | tr 'é' e | od -c
0000000   e   e  \n
0000003

Since the left-hand character set is 2 characters long, tr repeats the last character of the right-hand set until it is the same length.

$ echo 123456789 | tr 2468 xy
1x3y5y7y9

You might prefer sed for handling non-ASCII characters.

$ echo 'é' | sed 's/é/e/g' | od -c
0000000   e  \n
0000002
$ echo 'é' | sed 'y/é/e/' | od -c
0000000   e  \n
0000002

CodePudding user response:

'é' is two characters wide and will produce 'ee' when using tr.

$ echo 'é' | tr 'é' 'e'
ee

Transliterate 'é' into two characters, a literal 'e' followed by '\b' a backspace to remove the second 'e'.

$ echo 'é' | od -c
0000000   �   �  \n
0000003

$ echo é | tr 'é' 'e\b' |od -c
0000000   e  \b  \n
0000003

EDIT: This approach is flawed. $ echo é | tr 'é' 'e\b' looks like it works but may produce odd results. While you see an 'e' on your terminal in actuality your terminal displays 'e\b'. That is an 'e' followed by a non-printing backspace character. Better off using sed. sed 'y|é|e|'

  •  Tags:  
  • bash
  • Related