When I use json.dumps
in Python 3.8 for special characters they are being "escaped", like:
>>> import json
>>> json.dumps({'Crêpes': 5})
'{"Cr\\u00eapes": 5}'
What kind of encoding is this? Is this an "escape encoding"? And why is this kind of encoding not part of the encodings
module? (Also see codecs
, I think I tried all of them.)
To put it another way, how can I convert the string 'Crêpes'
to the string 'Cr\\u00eapes'
using Python encodings, escaping, etc.?
CodePudding user response:
You are probably confused by the fact that this is a JSON string, not directly a Python string.
Python would encode this string as "Cr\u00eapes"
, where \u00ea
represents a single Unicode character using its hexadecimal code point. In other words, in Python, len("\u00ea") == 1
JSON requires the same sort of encoding, but embedding the JSON-encoded value in a Python string requires you to double the backslash; so in Python's representation, this becomes "Cr\\u00eapes"
where you have a literal backslash, two literal zeros, a literal e
character, and a literal a
character. Thus, len("\\u00ea") == 6
If you have JSON in a file, the absolutely simplest way to load it into Python is to use json.loads()
to read and decode it into a native Python data structure.
If you need to decode the hexadecimal sequence separately, the unicode-escape
function does that on a byte value:
>>> b"Cr\\u00eapes".decode('unicode-escape')
'Crêpes'
This is sort of coincidental, and works simply because the JSON representation happens to be identical to the Python unicode-escape
representation. You still need a b'...'
aka bytes
input for that. ("Crêpes".encode('unicode-escape')
produces a slightly different representation. "Cr\\u00eapes".encode('us-ascii')
produces a bytes
string with the Unicode representation b"Cr\\u00eapes"
.)
CodePudding user response:
It is not a Python encoding. It is the way JSON encodes Unicode non-ASCII characters. It is independent of Python and is used exactly the same for example in Java or with a C or C library.
The rule is that a non-ASCII character in the Basic Multilingual Plane (i.e. with a maximum 16 bits code) is encoded as \uxxxx
where xxxx is the unicode code value.
Which explains why the ê
is written as \u00ea
, because its unicode code point is U 00EA