Home > Back-end >  Troubles parsing XML file to JSON file with spanish text on Python
Troubles parsing XML file to JSON file with spanish text on Python

Time:09-07

Im having some issues when a XML file is parsed to JSON,so this is the XML file looks like:

<return>
  <ciudad>BARRANQUILLA</ciudad>
  <codProducto>5</codProducto>
  <enviado>0</enviado>
  <fechaCaptura>2020-03-18T00:00:00-05:00</fechaCaptura>
  <fechaCreacion>2020-03-18T14:00:01-05:00</fechaCreacion>
  <precioPromedio>811</precioPromedio>
  <producto>Ch&#195;&#179;colo mazorca</producto>
  <regId>316992</regId>
</return>
<return>
  <ciudad>BARRANQUILLA</ciudad>
  <codProducto>8</codProducto>
  <enviado>0</enviado>
  <fechaCaptura>2020-03-18T00:00:00-05:00</fechaCaptura>
  <fechaCreacion>2020-03-18T14:00:01-05:00</fechaCreacion>
  <precioPromedio>2063</precioPromedio>
  <producto>Piment&#195;&#179;n</producto>
  <regId>316995</regId>
</return>

This is the code that Im using to parse the file,using xmltodict library:

with open('result.xml', 'r', encoding='iso-8859-1') as xmlarch:
    with open('result.json', 'w', encoding='iso-8859-1') as json_f:
        obj = xmltodict.parse(xmlarch.read())
        json.dump(obj, json_f, indent=4)

But some characters are encoded in the JSON file when the file is parsed

[
  {
    "ciudad": "BARRANQUILLA",
    "codProducto": "5",
    "enviado": "0",
    "fechaCaptura": "2020-03-18T00:00:00-05:00",
    "fechaCreacion": "2020-03-18T14:00:01-05:00",
    "precioPromedio": "811",
    "producto": "Chócolo mazorca",
    "regId": "316992"
  },
  {
    "ciudad": "BARRANQUILLA",
    "codProducto": "8",
    "enviado": "0",
    "fechaCaptura": "2020-03-18T00:00:00-05:00",
    "fechaCreacion": "2020-03-18T14:00:01-05:00",
    "precioPromedio": "2063",
    "producto": "Pimentón",
    "regId": "316995"
  }
]

I never worked a file that contains those spanish characters, maybe the problem is the encoding part I think, already tried some other encodings like uft-8 but did not work, also any feedback apreciated!

CodePudding user response:

You need to add ensure_ascii=False to the json parser

From the json documentation

this module’s serializer sets ensure_ascii=True by default, thus escaping the output so that the resulting strings only contain ASCII characters.

So the result:

with open('result.xml', 'r', encoding='iso-8859-1') as xmlarch:
    with open('result.json', 'w', encoding='iso-8859-1') as json_f:
        obj = xmltodict.parse(xmlarch.read())
        json.dump(obj, json_f, indent=4, ensure_ascii=False)
  • Related