Home > database >  option for \u instead of Unicode replacement
option for \u instead of Unicode replacement

Time:11-29

If I run this Go code:

package main

import (
   "encoding/json"
   "os"
)

func main() {
   json.NewEncoder(os.Stdout).Encode("\xa1") // "\ufffd"
}

I lose data, since once the Unicode replacement is done, I can no longer get back the original value. Compare with this Python code:

import json

a = '\xa1'
b = json.dumps(a) # "\u00a1"
print(json.loads(b) == a) # True

no replacement is done, so no data is lost. In addition, the resultant JSON is still valid. Does Go have some method to encode JSON string with escaping instead of replacement?

CodePudding user response:

This example is a false equivalence. The '\xa1' is a valid Unicode string in Python, it's just one possible representation like '\u00a1' or '\U000000a1' or chr(0xa1) or '\N{INVERTED EXCLAMATION MARK}' or '¡' or ...

The equivalent in Python code would be:

>>> print(json.dumps(b'\xa1'.decode(errors='replace')))
"\ufffd"

Which is also printing an ascii representation of the coerced REPLACEMENT CHARACTER on stdout, the same as in Go.

CodePudding user response:

This is because "\xa1" is not a valid Unicode string. It contains the byte 0xa1, which is not valid (not valid by itself). The not valid byte gets replaced with U FFFD, which is the “replacement character”—used when the input is invalid.

If you want to encode the Unicode character U 00A1, write it as "\u00a1". If you want to make arbitrary data go round-trip through JSON, you will have to represent it a different way (like base64 encoding it, for example).

Python just works differently—in Python, the \xa1 escape sequence is U 00A1. Again, in Go, \xa1 is the byte 0xa1, which is not a valid Unicode string by itself and cannot be encoded as a JSON string.

  • Related