Home > Enterprise >  Parsing an XML file that contains HTML snippets, renaming HTML class names, and then write back the
Parsing an XML file that contains HTML snippets, renaming HTML class names, and then write back the

Time:12-05

I've got XML files that contain HTML snippets. I'm trying to write a Python script that opens such an XML file, searches for the elements containing the HTML, renames the classes, and then writes back the new XML file to file. Here's an XML example:

<?xml version="1.0" encoding="UTF-8"?>
<question_categories>
  <question_category id="18883">
    <name>templates</name>
    <questions>
      <question id="1419226">
        <parent>0</parent>
        <name>_template_master</name>
        <questiontext>
          &lt;div class=&quot;wrapper&quot;&gt;
            &lt;div class=&quot;wrapper element&quot;&gt;
              &lt;span&gt;Exercise 1&lt;/span&gt;
            &lt;/div&gt;
          &lt;/div&gt;
        </questiontext>
      </question>
      <question id="1419238">
        <parent>0</parent>
        <name>_template_singleDropDown</name>
        <questiontext>
          &lt;div class=&quot;wrapper&quot;&gt;
            &lt;div class=&quot;element wrapper&quot;&gt;
              &lt;span&gt;Exercise 2&lt;/span&gt;
            &lt;/div&gt;
          &lt;/div&gt;
        </questiontext>
      </question>
    </questions> 
  </question_category>
</question_categories>

The element containing the HTML is <questiontext>, the HTML class to be renamed is wrapper, and the new class name should be prefixed-wrapper.

I succeeded to loop through the XML, extracting the HTML and also to rename the class, but I don't know how to put everything together, so at the end I get an XML file with the renamed class names. This is my code so far:

from bs4 import BeautifulSoup
with open('dummy_short.xml', 'r') as f:
    file = f.read() 

soup_xml = BeautifulSoup(file, 'xml')

for questiontext in soup_xml.find_all('questiontext'):    
    for singleclass in BeautifulSoup(questiontext.text, 'html.parser').find_all(class_='wrapper'):
        pos = singleclass.attrs['class'].index('wrapper')
        singleclass.attrs['class'][pos] = 'prefixed-wrapper'  

print(soup_xml)

Unfortunately, when printing soup_xml at the end, the contents are unchanged, i.e. the class names aren't renamed.

EDIT: Since one and the same class name can occur in very different and complex contexts (for example along with other classes, i.e. ), a static match isn't working. And instead of using complicated and non-comprehensible regexes, I have to use a parser like beautifulsoup (because they are made exactly for this purpose!).

CodePudding user response:

After your comment I have changed my code a little bit. Now the html part is correct escaped, but the empty tags are gone. Anyway the XML is valid. It seems tree.write() have some trouble with mixed XML and inserted html sequences.

import xml.etree.ElementTree as ET
from html import escape, unescape

tree = ET.parse('source.xml')
root = tree.getroot()

def replace_html(elem):
    dummyXML = ET.fromstring(elem)
    for htm in dummyXML.iter('div'):
        if htm.tag == "div" and htm.get('class') =="wrapper":
            htm.set('class', "prefixed-wrapper")       
    return ET.tostring(dummyXML, method='html').decode('utf-8')
    
for elem in root.iter("questiontext"):
    html = replace_html(unescape(elem.text))
    elem.text = escape(html)
   
with open('new.xml', 'w') as f:
    f.write(f'<?xml version="1.0" encoding="UTF-8"?>')

with open('new.xml', 'a') as f:
    f.write(ET.tostring(root).decode('utf-8').replace('&amp;','&'))

The source XML file is "source.xml" and the updated XML file name is "new.xml".

Output (changed part only):

<questiontext>
    &lt;div class=&quot;prefixed-wrapper&quot;&gt;
        &lt;div class=&quot;wrapper element&quot;&gt;
            &lt;span&gt;Exercise 1&lt;/span&gt;
        &lt;/div&gt;
    &lt;/div&gt;
</questiontext>

CodePudding user response:

Option2: Your prefered BeautifulSoup Solution

from bs4 import BeautifulSoup
#from xml.sax.saxutils import quoteattr, escape, unescape
import re

# Get the XML soup
with open('source.xml', 'r') as f:
    file = f.read() 
soup_xml = BeautifulSoup(file, 'xml')

def soup_htm(elm):
    """Modify attributes according request """
    # Get the html soup
    soup = BeautifulSoup(elm.string, 'html.parser')
    
    
    for elem in soup.find_all('div'):
        if elem.attrs== {'class': ['wrapper']}:
            elem['class'] = ['prefixed-wrapper']
        if elem.attrs== {'class': ['wrapper', 'element']}:
            elem['class'] = ['prefixed-wrapper', 'element']
        if elem.attrs== {'class': ['element', 'wrapper']}:
            elem['class'] = ['element', 'prefixed-wrapper']         
    return re.sub('"','&quot;', str(soup))

# Find element and replace it                  
for questiontext in soup_xml.find_all('questiontext'):
    htm_changed = soup_htm(questiontext)
    questiontext = questiontext.string.wrap(soup_xml.new_tag("questiontext")).replace_with(htm_changed)
  
# Print result
print(soup_xml.prettify())

I prefere the inbuild python, but this is also nice and maybe easier with such mixed XML/HTML documents. Anyway the single/ double quotes makes trouble. Maybe another user can help.

  • Related