Home > Back-end >  Fixing HTML Tags within XML files using Python
Fixing HTML Tags within XML files using Python

Time:12-21

I have been given .htm files that are structured in XML format and have HTML tags within them. The issue is that alot of these HTML tags along the way have been converted. For example & lt; has been converted to <, & amp; has been converted to & etc. Is there a python module that is able fix these HTML entities kindof like: HTML Corrector

For example:

<Employee>
  <name> Adam</name
  <age> > 24 </age>
  <Nicknames> A & B </Nicknames>
</Employee>

In this above example, the > in age would be converted to '& gt;' and the & would converted to '& amp;'

Desired Result:

<Employee>
  <name> Adam</name
  <age> &gt; 24 </age>
  <Nicknames> A &amp; B </Nicknames>
</Employee>

CodePudding user response:

If the HTML is well-formed, you can just convert to a BeautifulSoup object (from beautifulsoup4) and the inner text of each tag will be escaped:

my_html = \
"""<Employee>
<name> Adam</name>
<age> > 24 </age>
<Nicknames> A & B </Nicknames>
</Employee>"""

soup = BeautifulSoup(my_html)
print(soup)

Outputs:

<employee>
<name> Adam</name>
<age> &gt; 24 </age>
<nicknames> A &amp; B </nicknames>
</employee>

Not sure if this was intentional, but the exact example you provided includes a broken tag, </name without the closing >. You'd need to fix this which is tricker—you could maybe use a regular expression. This gets the correct output for your example:

import re
from bs4 import BeautifulSoup

my_html = \
"""<Employee>
<name> Adam</name
<age> > 24 </age>
<Nicknames> A & B </Nicknames>
</Employee>"""

my_html = re.sub(r"</([^>]*)(\s)", r"<\1>\2", my_html)
soup = BeautifulSoup(my_html)
print(soup)
  • Related