I tried to encode the XML file so that it could read the invalid content without problem, however it did not work.
This is my code:
import xml.etree.ElementTree as ET
import io
file_path = r'c:\data\MSM\Energy\XML-files\my_xml.xml'
with io.open(file_path, 'r', encoding='utf-8-sig') as f:
contents = f.read()
tree = ET.fromstring(contents)
This is what I receive:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 62, column 48
This is how XML file line 62 looks like:
62 <Organisation>Blue & Logistics B.V.</Organisation>
I'm sure it has to do with the &
sign, so how can I encode that?
CodePudding user response:
Load the xml as text replace the &
and use xml parser
import xml.etree.ElementTree as ET
with open('x.xml') as f:
xml = f.read()
xml = xml.replace("&", "&")
root = ET.fromstring(xml)
print(root)
x.xml
<r>
<Organisation>Blue & Logistics B.V.</Organisation>
</r>
output
<Element 'r' at 0x7f431e86bc70>
CodePudding user response:
First, it has nothing to do with encoding. It's simply that your file doesn't contain well-formed XML. Find out how, where, and when it was created, and fix the process that created it. An &
in content needs to be escaped, typically as &
.
Don't try repairing bad XML except in desperation - you're very likely to make things worse, especially if you have to handle multiple input documents from the same unreliable source.