Home > Net >  How to convert - DOUBLE LOW-9 QUOTATION MARK „ using SED command
How to convert - DOUBLE LOW-9 QUOTATION MARK „ using SED command

Time:09-17

I am getting a special character - '„' in our source files. we are using the following sed command to replace the '„' character to '&' using the below command line but the operation isn't successful.

cat File.txt | sed 's/\x2\xE1/&/g' > File_New.txt

HEX CODE for „ is #x201E;

CodePudding user response:

I think perl is far superior than sed at working with Unicode text (Assuming here that your file is encoded using UTF-8):

$ cat input.txt
foo „ bar
$ perl -CSD -pe 's/\N{U 201E}/&/g' input.txt
foo & bar

(-CSD tells perl that standard input/output/error and all opened files are using UTF-8)

But (with the appropriate locale) you can use sed and a shell like bash that implements ANSI-C quoting to generate the character:

$ sed 's/'$'\u201E''/\&/g' input.txt
foo & bar

or just including the codepoint's UTF-8 bytes directly instead of using an escape sequence will typically work too:

$ sed 's/„/\&/g' input.txt
foo & bar

Some versions of sed, like the GNU one, support \xHH to represent a byte with the given hexadecimal value, but the Unicode codepoint U 201E is not encoded with those bytes in UTF-8; instead it's the three byte sequence E2 80 9E

$ sed 's/\xE2\x80\x9E/\&/g' input.txt
foo & bar

All the sed examples escape the & in the replacement because without the backslash before it, & is replaced by the matched text, leaving you right back where you started from.

CodePudding user response:

With a recent GNU sed:

$ printf '\u201E\n' | hd

00000000  e2 80 9e 0a                                       |....|
00000004

$ printf 'a\u201Eb\n' | LC_ALL=C sed 's/\xe2\x80\x9e/\&/g'
a&b

Documentation: info sed -n 'Locale Considerations'

Since & is a meta character in the replacement section of sed's s command it must be escaped.

  • Related