How do I remove a comment outside of the root element of an XML document using python lxml-CodePudding

How do you remove comments above or below the root node of an xml document using python's lxml module? I want to remove only one comment above the root node, NOT all comments in the entire document. For instance, given the following xml document

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- This comment needs to be removed -->
<root>
    <!-- This comment needs to STAY -->
    <a/>
</root>

I want to output

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
    <!-- This comment needs to STAY -->
    <a/>
</root>

The usual way to remove an element would be to do element.getparent().remove(element), but this doesn't work for the root element since getparent returns None. I also tried the suggestions from this stackoverflow answer, but the first answer (using a parser that remove comments) removes all comments from the document including the ones I want to keep, and the second answer (adding a dummy opening and closing tag around the document) doesn't work if the document has a directive above the root element.

I can get access to the comment above the root element using the following code, but how do I remove it from the document?

from lxml import etree as ET
tree = ET.parse("./sample_file.xml")
root = tree.getroot()
comment = root.getprevious()
# What do I do with comment now??

I've tried doing the following, but none of them worked:

comment.getparent().remove(comment) says AttributeError: 'NoneType' object has no attribute 'remove'
del comment does nothing
comment.clear() does nothing
comment.text = "" renders an empty comment 
root.remove(comment) says ValueError: Element is not a child of this node.
tree.remove(comment) says AttributeError: 'lxml.etree._ElementTree' object has no attribute 'remove'
tree[:] = [root] says TypeError: 'lxml.etree._ElementTree' object does not support item assignment
Initialize a new tree with tree = ET.ElementTree(root). Serializing this new tree still has the comments somehow.

CodePudding user response：

You could just build another tree by using fromstring() and passing in the root element.

from lxml import etree

tree = etree.parse("sample_file.xml")

new_tree = etree.fromstring(etree.tostring(tree.getroot()))

print(etree.tostring(new_tree, xml_declaration=True, encoding="UTF-8", standalone=True).decode())

printed output...

<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<root>
    <!-- This comment needs to STAY -->
    <a/>
</root>

Note: This will also remove any processing instructions before root, so another option is to append the comment to root before removing...

from lxml import etree

tree = etree.parse("sample_file.xml")
root = tree.getroot()

for comment_to_delete in root.xpath("preceding::comment()"):
    root.append(comment_to_delete)
    root.remove(comment_to_delete)

print(etree.tostring(tree, xml_declaration=True, encoding="UTF-8", standalone=True).decode())

This produces the same output as above, but will retain any processing instructions that occur before root.

CodePudding user response：

You can parse a XML file with comments with the xmlPullParser:

If your input file looks like:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- This comment needs to be removed -->
<root>
    <!-- This comment needs to STAY -->
    <a/>
    <b>Text</b>
</root>

Parse the file and write it to a new one:

import xml.etree.ElementTree as ET
import re

# Write XML declaration line into neu file without comment 1
def write_delte_xml(input):
    with open('Cleaned.xml', 'a') as my_file:
        my_file.write(f'{input}')

with open('Remove_Comment.xml', 'r', encoding='utf-8') as xml:
    feedstring = xml.readlines()

parser = ET.XMLPullParser(['start','end', 'comment'])
for line in enumerate(feedstring):
    if line[0] == 0 and line[1].startswith('<?'):
        write_delte_xml(line[1])

    parser.feed(line[1])
    
    for event, elem in parser.read_events():
        if event == "comment" and line[0] != 1:
            write_delte_xml(line[1])
            #print(line[1])
  
        if event == "start" and r'\>' not in line[1]:
            write_delte_xml(f"{line[1]}")
            #print("start",f"{line[1]},Element: {elem}")
            
        if event == "end":
            write_delte_xml(f"{line[1]}")
            #print(f"END: {line[1]}")

# Clean douplicates
xml_list = []
with open('Cleaned.xml', 'rb') as xml:
    lines = xml.readlines()
    
    for line in lines:
        if line not in xml_list:
            xml_list.append(line)
        
with open('Cleaned_final.xml', 'wb') as my_file:
    for line in xml_list:
        my_file.write(line)
        
print('Cleaned.xml')

Output:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
    <!-- This comment needs to STAY -->
    <a/>
    <b>Text</b>
</root>