Home > Enterprise >  How to handle the encoding of the XML parser?
How to handle the encoding of the XML parser?

Time:10-26

I tried to encode the XML file so that it could read the invalid content without problem, however it did not work.

This is my code:

import xml.etree.ElementTree as ET
import io

file_path = r'c:\data\MSM\Energy\XML-files\my_xml.xml' 

with io.open(file_path, 'r', encoding='utf-8-sig') as f:
    contents = f.read()
    tree = ET.fromstring(contents)

This is what I receive:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 62, column 48

This is how XML file line 62 looks like:

62    <Organisation>Blue & Logistics B.V.</Organisation>

I'm sure it has to do with the & sign, so how can I encode that?

CodePudding user response:

Load the xml as text replace the & and use xml parser

import xml.etree.ElementTree as ET

with open('x.xml') as f:
  xml = f.read()
  xml = xml.replace("&", "&#38;")
  root = ET.fromstring(xml)
  print(root)

x.xml

<r>
  <Organisation>Blue & Logistics B.V.</Organisation>
</r>

output

<Element 'r' at 0x7f431e86bc70>

CodePudding user response:

First, it has nothing to do with encoding. It's simply that your file doesn't contain well-formed XML. Find out how, where, and when it was created, and fix the process that created it. An & in content needs to be escaped, typically as &amp;.

Don't try repairing bad XML except in desperation - you're very likely to make things worse, especially if you have to handle multiple input documents from the same unreliable source.

  • Related