How do you remove comments above or below the root node of an xml document using python's lxml
module? I want to remove only one comment above the root node, NOT all comments in the entire document. For instance, given the following xml document
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- This comment needs to be removed -->
<root>
<!-- This comment needs to STAY -->
<a/>
</root>
I want to output
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<!-- This comment needs to STAY -->
<a/>
</root>
The usual way to remove an element would be to do element.getparent().remove(element)
, but this doesn't work for the root element since getparent
returns None
. I also tried the suggestions from this stackoverflow answer, but the first answer (using a parser that remove comments) removes all comments from the document including the ones I want to keep, and the second answer (adding a dummy opening and closing tag around the document) doesn't work if the document has a directive above the root element.
I can get access to the comment above the root element using the following code, but how do I remove it from the document?
from lxml import etree as ET
tree = ET.parse("./sample_file.xml")
root = tree.getroot()
comment = root.getprevious()
# What do I do with comment now??
I've tried doing the following, but none of them worked:
comment.getparent().remove(comment)
saysAttributeError: 'NoneType' object has no attribute 'remove'
del comment
does nothingcomment.clear()
does nothingcomment.text = ""
renders an empty comment<!---->
root.remove(comment)
saysValueError: Element is not a child of this node.
tree.remove(comment)
saysAttributeError: 'lxml.etree._ElementTree' object has no attribute 'remove'
tree[:] = [root]
saysTypeError: 'lxml.etree._ElementTree' object does not support item assignment
- Initialize a new tree with
tree = ET.ElementTree(root)
. Serializing this new tree still has the comments somehow.
CodePudding user response:
You could just build another tree by using fromstring() and passing in the root element.
from lxml import etree
tree = etree.parse("sample_file.xml")
new_tree = etree.fromstring(etree.tostring(tree.getroot()))
print(etree.tostring(new_tree, xml_declaration=True, encoding="UTF-8", standalone=True).decode())
printed output...
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<root>
<!-- This comment needs to STAY -->
<a/>
</root>
Note: This will also remove any processing instructions before root
, so another option is to append the comment to root
before removing...
from lxml import etree
tree = etree.parse("sample_file.xml")
root = tree.getroot()
for comment_to_delete in root.xpath("preceding::comment()"):
root.append(comment_to_delete)
root.remove(comment_to_delete)
print(etree.tostring(tree, xml_declaration=True, encoding="UTF-8", standalone=True).decode())
This produces the same output as above, but will retain any processing instructions that occur before root
.
CodePudding user response:
You can parse a XML file with comments with the xmlPullParser:
If your input file looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- This comment needs to be removed -->
<root>
<!-- This comment needs to STAY -->
<a/>
<b>Text</b>
</root>
Parse the file and write it to a new one:
import xml.etree.ElementTree as ET
import re
# Write XML declaration line into neu file without comment 1
def write_delte_xml(input):
with open('Cleaned.xml', 'a') as my_file:
my_file.write(f'{input}')
with open('Remove_Comment.xml', 'r', encoding='utf-8') as xml:
feedstring = xml.readlines()
parser = ET.XMLPullParser(['start','end', 'comment'])
for line in enumerate(feedstring):
if line[0] == 0 and line[1].startswith('<?'):
write_delte_xml(line[1])
parser.feed(line[1])
for event, elem in parser.read_events():
if event == "comment" and line[0] != 1:
write_delte_xml(line[1])
#print(line[1])
if event == "start" and r'\>' not in line[1]:
write_delte_xml(f"{line[1]}")
#print("start",f"{line[1]},Element: {elem}")
if event == "end":
write_delte_xml(f"{line[1]}")
#print(f"END: {line[1]}")
# Clean douplicates
xml_list = []
with open('Cleaned.xml', 'rb') as xml:
lines = xml.readlines()
for line in lines:
if line not in xml_list:
xml_list.append(line)
with open('Cleaned_final.xml', 'wb') as my_file:
for line in xml_list:
my_file.write(line)
print('Cleaned.xml')
Output:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<!-- This comment needs to STAY -->
<a/>
<b>Text</b>
</root>