Home > Enterprise >  How to keep html tags when writing a ElementTree tree to disk?
How to keep html tags when writing a ElementTree tree to disk?

Time:10-26

I'm trying to write an XML tree to disk using Python's xml.etree.ElementTree to reproduce an example document given to me. The target XML document has fields in it that look like:

<title>
This is a test of <br/> Hershey's <sup>&$174;</sup> chocolate factory machine <br/>
</title>

My problem is that whenever I try to write the text to disk using ElementTree's .write() method I can't achieve the above output. Either the html tags will get converted to &lt;br&gt; or the trademark symbol (the &#174 stuff) will show up as the actual symbol. Is there a way to encode my text to get the above output (where the trademark is represented by the &#174 characters but the html is html?). I've tried different encoding options in the write method but nothing seems to do the trick.

Edit: Here is a minimal working example. Take an input XML template file like:

<?xml version='1.0' encoding='UTF-8'?>
<document>
        <title> Text to replace </title>
</document>

and we try to modify the text like so

import xml.etree.ElementTree as ET

tree = ET.parse('example.xml')
root = tree.getroot()
to_sub_text = "This is a test of <br/> Hershey's <sup>&$174;</sup> chocolate factory machine"
spot = root.find('title')
spot.text = to_sub_text
tree.write('example_mod.xml', encoding='UTF-8', xml_declaration=True)

this will write to file a file:

<?xml version='1.0' encoding='UTF-8'?>
<document>
        <title>This is a test of &lt;br/&gt; Hershey's &lt;sup&gt;&amp;$174;&lt;/sup&gt; chocolate factory machine</title>
</document>

As I said, the document I'm trying to replicate leaves those html tags as tags. My questions are:

  1. Can I modify my code to do that?
  2. Is doing this good practice, or would have it been better to leave it as it currently is (and thus I need to talk to the team requesting I provide it to them in this way)?

CodePudding user response:

The spot.text = to_sub_text assignment does not work. An element's text property contains plain text only. It is not possible to use it to add both text and subelements.

What you can do is to create a new <title> element object and append that to the root:

import xml.etree.ElementTree as ET
 
tree = ET.parse('example.xml')
root = tree.getroot()
 
# Remove the old title element
old_title = root.find('title')
root.remove(old_title)
 
# Add a new title
new_title = "<title>This is a test of <br/> Hershey's <sup>&#174;</sup> chocolate factory machine</title>"
root.append(ET.fromstring(new_title))
 
# Prettify output (requires Python 3.9) 
ET.indent(tree)
 
# Use encoding='US-ASCII' to force output of character references for non-ASCII characters
tree.write('example_mod.xml', encoding='US-ASCII', xml_declaration=True)

Output in example_mod.xml:

<?xml version='1.0' encoding='US-ASCII'?>
<document>
  <title>This is a test of <br /> Hershey's <sup>&#174;</sup> chocolate factory machine</title>
</document>
  • Related