Home > database >  How do I remove a comment outside of the root element of an XML document using python lxml
How do I remove a comment outside of the root element of an XML document using python lxml

Time:12-23

How do you remove comments above or below the root node of an xml document using python's lxml module? I want to remove only one comment above the root node, NOT all comments in the entire document. For instance, given the following xml document

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- This comment needs to be removed -->
<root>
    <!-- This comment needs to STAY -->
    <a/>
</root>

I want to output

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
    <!-- This comment needs to STAY -->
    <a/>
</root>

The usual way to remove an element would be to do element.getparent().remove(element), but this doesn't work for the root element since getparent returns None. I also tried the suggestions from this stackoverflow answer, but the first answer (using a parser that remove comments) removes all comments from the document including the ones I want to keep, and the second answer (adding a dummy opening and closing tag around the document) doesn't work if the document has a directive above the root element.

I can get access to the comment above the root element using the following code, but how do I remove it from the document?

from lxml import etree as ET
tree = ET.parse("./sample_file.xml")
root = tree.getroot()
comment = root.getprevious()
# What do I do with comment now??

I've tried doing the following, but none of them worked:

  1. comment.getparent().remove(comment) says AttributeError: 'NoneType' object has no attribute 'remove'
  2. del comment does nothing
  3. comment.clear() does nothing
  4. comment.text = "" renders an empty comment <!---->
  5. root.remove(comment) says ValueError: Element is not a child of this node.
  6. tree.remove(comment) says AttributeError: 'lxml.etree._ElementTree' object has no attribute 'remove'
  7. tree[:] = [root] says TypeError: 'lxml.etree._ElementTree' object does not support item assignment
  8. Initialize a new tree with tree = ET.ElementTree(root). Serializing this new tree still has the comments somehow.

CodePudding user response:

You could just build another tree by using fromstring() and passing in the root element.

from lxml import etree

tree = etree.parse("sample_file.xml")

new_tree = etree.fromstring(etree.tostring(tree.getroot()))

print(etree.tostring(new_tree, xml_declaration=True, encoding="UTF-8", standalone=True).decode())

printed output...

<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<root>
    <!-- This comment needs to STAY -->
    <a/>
</root>

Note: This will also remove any processing instructions before root, so another option is to append the comment to root before removing...

from lxml import etree

tree = etree.parse("sample_file.xml")
root = tree.getroot()

for comment_to_delete in root.xpath("preceding::comment()"):
    root.append(comment_to_delete)
    root.remove(comment_to_delete)

print(etree.tostring(tree, xml_declaration=True, encoding="UTF-8", standalone=True).decode())

This produces the same output as above, but will retain any processing instructions that occur before root.

CodePudding user response:

You can parse a XML file with comments with the xmlPullParser:

If your input file looks like:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- This comment needs to be removed -->
<root>
    <!-- This comment needs to STAY -->
    <a/>
    <b>Text</b>
</root>

Parse the file and write it to a new one:

import xml.etree.ElementTree as ET
import re

# Write XML declaration line into neu file without comment 1
def write_delte_xml(input):
    with open('Cleaned.xml', 'a') as my_file:
        my_file.write(f'{input}')

with open('Remove_Comment.xml', 'r', encoding='utf-8') as xml:
    feedstring = xml.readlines()

parser = ET.XMLPullParser(['start','end', 'comment'])
for line in enumerate(feedstring):
    if line[0] == 0 and line[1].startswith('<?'):
        write_delte_xml(line[1])

    parser.feed(line[1])
    
    for event, elem in parser.read_events():
        if event == "comment" and line[0] != 1:
            write_delte_xml(line[1])
            #print(line[1])
  
        if event == "start" and r'\>' not in line[1]:
            write_delte_xml(f"{line[1]}")
            #print("start",f"{line[1]},Element: {elem}")
            
        if event == "end":
            write_delte_xml(f"{line[1]}")
            #print(f"END: {line[1]}")

# Clean douplicates
xml_list = []
with open('Cleaned.xml', 'rb') as xml:
    lines = xml.readlines()
    
    for line in lines:
        if line not in xml_list:
            xml_list.append(line)
        
with open('Cleaned_final.xml', 'wb') as my_file:
    for line in xml_list:
        my_file.write(line)
        
print('Cleaned.xml')

Output:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
    <!-- This comment needs to STAY -->
    <a/>
    <b>Text</b>
</root>
  • Related