I need to remove some parts of a XML file, for example this file:
<dict>
<key>Images</key>
<array>
<dict>
<key>ImageIndex</key>
<integer>0</integer>
<key>NumberOfROIs</key>
<integer>42</integer>
<key>ROIs</key>
<array>
<dict>
<key>Area</key>
<real>0.0</real>
<key>Center</key>
<string>(0.000000, 0.000000, 0.000000)</string>
<key>Dev</key>
<real>0.0</real>
<key>IndexInImage</key>
<integer>0</integer>
<key>Max</key>
<real>1358</real>
<key>Mean</key>
<real>1358</real>
<key>Min</key>
<real>1358</real>
<key>Name</key>
<string>Calcification</string>
<key>NumberOfPoints</key>
<integer>1</integer>
<key>Point_mm</key>
<array>
<string>(0.000000, 0.000000, 0.000000)</string>
</array>
<key>Point_px</key>
<array>
<string>(2964.620117, 3427.979980)</string>
</array>
<key>Total</key>
<real>1358</real>
<key>Type</key>
<integer>19</integer>
</dict>
<dict>
<key>Area</key>
<real>0.0</real>
<key>Center</key>
<string>(0.000000, 0.000000, 0.000000)</string>
<key>Dev</key>
<real>0.0</real>
<key>IndexInImage</key>
<integer>1</integer>
<key>Max</key>
<real>1401</real>
<key>Mean</key>
<real>1401</real>
<key>Min</key>
<real>1401</real>
<key>Name</key>
<string>Calcification</string>
<key>NumberOfPoints</key>
<integer>1</integer>
<key>Point_mm</key>
<array>
<string>(0.000000, 0.000000, 0.000000)</string>
</array>
<key>Point_px</key>
<array>
<string>(2993.159912, 3403.550049)</string>
</array>
<key>Total</key>
<real>1401</real>
<key>Type</key>
<integer>19</integer>
</dict>
<dict>
<key>Area</key>
<real>1.3665732145309448</real>
<key>Center</key>
<string>(0.000000, 0.000000, 0.000000)</string>
<key>Dev</key>
<real>66.487342834472656</real>
<key>IndexInImage</key>
<integer>36</integer>
<key>Max</key>
<real>1836</real>
<key>Mean</key>
<real>1583.29638671875</real>
<key>Min</key>
<real>1313</real>
<key>Name</key>
<string>Mass</string>
<key>NumberOfPoints</key>
<integer>89</integer>
<key>Point_mm</key>
<array>
<string>(0.000000, 0.000000, 0.000000)</string>
<string>(0.000000, 0.000000, 0.000000)</string>
</array>
<key>Point_px</key>
<array>
<string>(3196.290039, 1048.599976)</string>
<string>(3203.560059, 1046.170044)</string>
<string>(3211.330078, 1042.780029)</string>
<string>(3189.500000, 1050.540039)</string>
</array>
<key>Total</key>
<real>44457380</real>
<key>Type</key>
<integer>15</integer>
</dict>
</array>
</dict>
</array>
</dict>
</plist>
I want to remove everything between < dict > < /dict >, included, that have a < string > Calcification < /string > in it, in other words, I want only the parts that does not have Calcification, my desired result for this file would be:
<dict>
<key>Images</key>
<array>
<dict>
<key>ImageIndex</key>
<integer>0</integer>
<key>NumberOfROIs</key>
<integer>42</integer>
<key>ROIs</key>
<array>
<dict>
<key>Area</key>
<real>1.3665732145309448</real>
<key>Center</key>
<string>(0.000000, 0.000000, 0.000000)</string>
<key>Dev</key>
<real>66.487342834472656</real>
<key>IndexInImage</key>
<integer>36</integer>
<key>Max</key>
<real>1836</real>
<key>Mean</key>
<real>1583.29638671875</real>
<key>Min</key>
<real>1313</real>
<key>Name</key>
<string>Mass</string>
<key>NumberOfPoints</key>
<integer>89</integer>
<key>Point_mm</key>
<array>
<string>(0.000000, 0.000000, 0.000000)</string>
<string>(0.000000, 0.000000, 0.000000)</string>
</array>
<key>Point_px</key>
<array>
<string>(3196.290039, 1048.599976)</string>
<string>(3203.560059, 1046.170044)</string>
<string>(3211.330078, 1042.780029)</string>
<string>(3189.500000, 1050.540039)</string>
</array>
<key>Total</key>
<real>44457380</real>
<key>Type</key>
<integer>15</integer>
</dict>
</array>
</dict>
</array>
</dict>
</plist>
this is what I have tried:
data = r"C:\Users\vinc\Desktop\ExemploXML.xml"
import xml.etree.ElementTree as ET
tree = ET.parse(data)
root = tree.getroot()
for e in root.findall(".//string"):
if e.text == 'Calcification':
print(e)
root.remove(e)
else:
pass
tree.write(r'C:\Users\vinc\Desktop\out.xml')
Result ======================================
<Element 'string' at 0x000002B085002EA0>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-43-d417d00038ed> in <module>
8
9 print(e)
---> 10 root.remove(e)
11 else:
12 pass
ValueError: list.remove(x): x not in list
For context, those XML files are semantic segmentation information, and I want to remove the Calcification class annotations.
CodePudding user response:
Here is XSLT based solution.
The XSLT below is following a so called Identity Transform pattern.
A single one line template removes not needed <dict>
elements:
<xsl:template match="dict[string='Calcification']"/>
How to transform an XML file using XSLT in Python?
XSLT
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="dict[string='Calcification']"/>
</xsl:stylesheet>
CodePudding user response:
Listing [Python.Docs]: xml.etree.ElementTree - The ElementTree XML API.
I always prefer searching nodes by XPATH, and specifying (as much as possible of) the full one. Of course, the drawback is that if the XML structure changes, the node paths (in the code) should be adapted accordingly.
Also, as a general pattern (don't know if applies here though), never remove elements from a container you're iterating on.
I saved your source XML in file00.xml (also removing the last (unmatched) tag ("</plist> ")).
code00.py:
#!/usr/bin/env python
import xml.etree.ElementTree as ET
import sys
def main(*argv):
xml_file_name = "./file00.xml"
tree = ET.parse(xml_file_name)
root = tree.getroot()
inner_array_nodes = root.findall("./array/dict/array") # XPATH
to_remove = []
for parent_node in inner_array_nodes:
for dict_node in parent_node:
string_nodes = dict_node.findall("string")
for string_node in string_nodes:
if string_node.text == "Calcification":
to_remove.append((parent_node, dict_node))
for parent, child in to_remove:
parent.remove(child)
print(b"".join(ET.tostringlist(root)).decode())
if __name__ == "__main__":
print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("\nDone.")
sys.exit(rc)
Output:
[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q070442605]> "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\Scripts\python.exe" code00.py Python 3.8.7 (tags/v3.8.7:6503f05, Dec 21 2020, 17:59:51) [MSC v.1928 64 bit (AMD64)] 064bit on win32 <dict> <key>Images</key> <array> <dict> <key>ImageIndex</key> <integer>0</integer> <key>NumberOfROIs</key> <integer>42</integer> <key>ROIs</key> <array> <dict> <key>Area</key> <real>1.3665732145309448</real> <key>Center</key> <string>(0.000000, 0.000000, 0.000000)</string> <key>Dev</key> <real>66.487342834472656</real> <key>IndexInImage</key> <integer>36</integer> <key>Max</key> <real>1836</real> <key>Mean</key> <real>1583.29638671875</real> <key>Min</key> <real>1313</real> <key>Name</key> <string>Mass</string> <key>NumberOfPoints</key> <integer>89</integer> <key>Point_mm</key> <array> <string>(0.000000, 0.000000, 0.000000)</string> <string>(0.000000, 0.000000, 0.000000)</string> </array> <key>Point_px</key> <array> <string>(3196.290039, 1048.599976)</string> <string>(3203.560059, 1046.170044)</string> <string>(3211.330078, 1042.780029)</string> <string>(3189.500000, 1050.540039)</string> </array> <key>Total</key> <real>44457380</real> <key>Type</key> <integer>15</integer> </dict> </array> </dict> </array> </dict> Done.
CodePudding user response:
Your XML has an extra plist tag.
Your code even if it did work is only trying to remove string tag which has the "Calcification" text in it and not the dict like you tried.
I have a working solution here - maybe not the most optimized code but works for sure I just tried it against your input
import xml.etree.ElementTree as ET
tree = ET.parse("sample.xml")
root = tree.getroot()
dict_list = []
array = root.find("./array/dict/array")
for each_dict in array.iter('dict'):
for each_string in each_dict.iter('string'):
if each_string.text == "Calcification":
dict_list.append(each_dict)
for each_dict in dict_list:
array.remove(each_dict)
tree.write('sample3.xml')