Home > front end >  UTF-8 characters in python string even after decoding from UTF-8?
UTF-8 characters in python string even after decoding from UTF-8?

Time:09-19

I'm working on converting portions of XHTML to JSON objects. I finally got everything in JSON form, but some UTF-8 character codes are being printed. Example:

{
  "p": {
    "@class": "para-p",
    "#text": "I\u2019m not on Earth."
  }
}

This should be:

{
  "p": {
    "@class": "para-p",
    "#text": "I'm not on Earth."
  }
}

This is just one example of UTF-8 codes coming through. How can I got through the string and replace every instance of a UTF-8 code with the character it represents?

CodePudding user response:

\u2019 is not a UTF-8 character, but a Unicode escape code. It's valid JSON and when read back via json.load will become (RIGHT SINGLE QUOTATION MARK).

If you want to write the actual character, use ensure_ascii=False to prevent escape codes from being written for non-ASCII characters:

with open('output.json','w',encoding='utf8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

CodePudding user response:

import json

x = "I\\u2019m not on Earth."
print(x)
x_json = json.loads(rf'"{x}"')
print(x_json)


I\u2019m not on Earth.
I’m not on Earth.
  • Related