Home > database >  What encoding is used by json.dumps?
What encoding is used by json.dumps?

Time:12-24

When I use json.dumps in Python 3.8 for special characters they are being "escaped", like:

>>> import json
>>> json.dumps({'Crêpes': 5})
'{"Cr\\u00eapes": 5}'

What kind of encoding is this? Is this an "escape encoding"? And why is this kind of encoding not part of the encodings module? (Also see codecs, I think I tried all of them.)

To put it another way, how can I convert the string 'Crêpes' to the string 'Cr\\u00eapes' using Python encodings, escaping, etc.?

CodePudding user response:

You are probably confused by the fact that this is a JSON string, not directly a Python string.

Python would encode this string as "Cr\u00eapes", where \u00ea represents a single Unicode character using its hexadecimal code point. In other words, in Python, len("\u00ea") == 1

JSON requires the same sort of encoding, but embedding the JSON-encoded value in a Python string requires you to double the backslash; so in Python's representation, this becomes "Cr\\u00eapes" where you have a literal backslash, two literal zeros, a literal e character, and a literal a character. Thus, len("\\u00ea") == 6

If you have JSON in a file, the absolutely simplest way to load it into Python is to use json.loads() to read and decode it into a native Python data structure.

If you need to decode the hexadecimal sequence separately, the unicode-escape function does that on a byte value:

>>> b"Cr\\u00eapes".decode('unicode-escape')
'Crêpes'

This is sort of coincidental, and works simply because the JSON representation happens to be identical to the Python unicode-escape representation. You still need a b'...' aka bytes input for that. ("Crêpes".encode('unicode-escape') produces a slightly different representation. "Cr\\u00eapes".encode('us-ascii') produces a bytes string with the Unicode representation b"Cr\\u00eapes".)

CodePudding user response:

It is not a Python encoding. It is the way JSON encodes Unicode non-ASCII characters. It is independent of Python and is used exactly the same for example in Java or with a C or C library.

The rule is that a non-ASCII character in the Basic Multilingual Plane (i.e. with a maximum 16 bits code) is encoded as \uxxxx where xxxx is the unicode code value.

Which explains why the ê is written as \u00ea, because its unicode code point is U 00EA

  • Related