Home > OS >  How can I remove a part of XML file?
How can I remove a part of XML file?

Time:12-22

I need to remove some parts of a XML file, for example this file:

<dict>
    <key>Images</key>
    <array>
        <dict>
            <key>ImageIndex</key>
            <integer>0</integer>
            <key>NumberOfROIs</key>
            <integer>42</integer>
            <key>ROIs</key>
            <array>
                <dict>
                    <key>Area</key>
                    <real>0.0</real>
                    <key>Center</key>
                    <string>(0.000000, 0.000000, 0.000000)</string>
                    <key>Dev</key>
                    <real>0.0</real>
                    <key>IndexInImage</key>
                    <integer>0</integer>
                    <key>Max</key>
                    <real>1358</real>
                    <key>Mean</key>
                    <real>1358</real>
                    <key>Min</key>
                    <real>1358</real>
                    <key>Name</key>
                    <string>Calcification</string>
                    <key>NumberOfPoints</key>
                    <integer>1</integer>
                    <key>Point_mm</key>
                    <array>
                        <string>(0.000000, 0.000000, 0.000000)</string>
                    </array>
                    <key>Point_px</key>
                    <array>
                        <string>(2964.620117, 3427.979980)</string>
                    </array>
                    <key>Total</key>
                    <real>1358</real>
                    <key>Type</key>
                    <integer>19</integer>
                </dict>
                <dict>
                    <key>Area</key>
                    <real>0.0</real>
                    <key>Center</key>
                    <string>(0.000000, 0.000000, 0.000000)</string>
                    <key>Dev</key>
                    <real>0.0</real>
                    <key>IndexInImage</key>
                    <integer>1</integer>
                    <key>Max</key>
                    <real>1401</real>
                    <key>Mean</key>
                    <real>1401</real>
                    <key>Min</key>
                    <real>1401</real>
                    <key>Name</key>
                    <string>Calcification</string>
                    <key>NumberOfPoints</key>
                    <integer>1</integer>
                    <key>Point_mm</key>
                    <array>
                        <string>(0.000000, 0.000000, 0.000000)</string>
                    </array>
                    <key>Point_px</key>
                    <array>
                        <string>(2993.159912, 3403.550049)</string>
                    </array>
                    <key>Total</key>
                    <real>1401</real>
                    <key>Type</key>
                    <integer>19</integer>
                </dict>
                <dict>
                    <key>Area</key>
                    <real>1.3665732145309448</real>
                    <key>Center</key>
                    <string>(0.000000, 0.000000, 0.000000)</string>
                    <key>Dev</key>
                    <real>66.487342834472656</real>
                    <key>IndexInImage</key>
                    <integer>36</integer>
                    <key>Max</key>
                    <real>1836</real>
                    <key>Mean</key>
                    <real>1583.29638671875</real>
                    <key>Min</key>
                    <real>1313</real>
                    <key>Name</key>
                    <string>Mass</string>
                    <key>NumberOfPoints</key>
                    <integer>89</integer>
                    <key>Point_mm</key>
                    <array>
                        <string>(0.000000, 0.000000, 0.000000)</string>
                        <string>(0.000000, 0.000000, 0.000000)</string>
                    </array>
                    <key>Point_px</key>
                    <array>
                        <string>(3196.290039, 1048.599976)</string>
                        <string>(3203.560059, 1046.170044)</string>
                        <string>(3211.330078, 1042.780029)</string>
                        <string>(3189.500000, 1050.540039)</string>
                    </array>
                    <key>Total</key>
                    <real>44457380</real>
                    <key>Type</key>
                    <integer>15</integer>
                </dict>
            </array>
        </dict>
    </array>
</dict>
</plist>  

I want to remove everything between < dict > < /dict >, included, that have a < string > Calcification < /string > in it, in other words, I want only the parts that does not have Calcification, my desired result for this file would be:

<dict>
    <key>Images</key>
    <array>
        <dict>
            <key>ImageIndex</key>
            <integer>0</integer>
            <key>NumberOfROIs</key>
            <integer>42</integer>
            <key>ROIs</key>
            <array>
                <dict>
                    <key>Area</key>
                    <real>1.3665732145309448</real>
                    <key>Center</key>
                    <string>(0.000000, 0.000000, 0.000000)</string>
                    <key>Dev</key>
                    <real>66.487342834472656</real>
                    <key>IndexInImage</key>
                    <integer>36</integer>
                    <key>Max</key>
                    <real>1836</real>
                    <key>Mean</key>
                    <real>1583.29638671875</real>
                    <key>Min</key>
                    <real>1313</real>
                    <key>Name</key>
                    <string>Mass</string>
                    <key>NumberOfPoints</key>
                    <integer>89</integer>
                    <key>Point_mm</key>
                    <array>
                        <string>(0.000000, 0.000000, 0.000000)</string>
                        <string>(0.000000, 0.000000, 0.000000)</string>
                    </array>
                    <key>Point_px</key>
                    <array>
                        <string>(3196.290039, 1048.599976)</string>
                        <string>(3203.560059, 1046.170044)</string>
                        <string>(3211.330078, 1042.780029)</string>
                        <string>(3189.500000, 1050.540039)</string>
                    </array>
                    <key>Total</key>
                    <real>44457380</real>
                    <key>Type</key>
                    <integer>15</integer>
                </dict>
            </array>
        </dict>
    </array>
</dict>
</plist> 

this is what I have tried:

data = r"C:\Users\vinc\Desktop\ExemploXML.xml"    
    
import xml.etree.ElementTree as ET
tree = ET.parse(data)
root = tree.getroot()
for e in root.findall(".//string"):
    if e.text == 'Calcification':
        
        print(e)
        root.remove(e)
    else:
        pass
tree.write(r'C:\Users\vinc\Desktop\out.xml')

Result ======================================

<Element 'string' at 0x000002B085002EA0>
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-43-d417d00038ed> in <module>
      8 
      9         print(e)
---> 10         root.remove(e)
     11     else:
     12         pass

ValueError: list.remove(x): x not in list

For context, those XML files are semantic segmentation information, and I want to remove the Calcification class annotations.

CodePudding user response:

Here is XSLT based solution.

The XSLT below is following a so called Identity Transform pattern.

A single one line template removes not needed <dict> elements:

<xsl:template match="dict[string='Calcification']"/>

How to transform an XML file using XSLT in Python?

XSLT

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="dict[string='Calcification']"/>
</xsl:stylesheet>

CodePudding user response:

Listing [Python.Docs]: xml.etree.ElementTree - The ElementTree XML API.

I always prefer searching nodes by XPATH, and specifying (as much as possible of) the full one. Of course, the drawback is that if the XML structure changes, the node paths (in the code) should be adapted accordingly.

Also, as a general pattern (don't know if applies here though), never remove elements from a container you're iterating on.

I saved your source XML in file00.xml (also removing the last (unmatched) tag ("</plist> ")).

code00.py:

#!/usr/bin/env python

import xml.etree.ElementTree as ET
import sys


def main(*argv):
    xml_file_name = "./file00.xml"
    tree = ET.parse(xml_file_name)
    root = tree.getroot()
    inner_array_nodes = root.findall("./array/dict/array")  # XPATH
    to_remove = []
    for parent_node in inner_array_nodes:
        for dict_node in parent_node:
            string_nodes = dict_node.findall("string")
            for string_node in string_nodes:
                if string_node.text == "Calcification":
                    to_remove.append((parent_node, dict_node))

    for parent, child in to_remove:
        parent.remove(child)

    print(b"".join(ET.tostringlist(root)).decode())


if __name__ == "__main__":
    print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    rc = main(*sys.argv[1:])
    print("\nDone.")
    sys.exit(rc)

Output:

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q070442605]> "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\Scripts\python.exe" code00.py
Python 3.8.7 (tags/v3.8.7:6503f05, Dec 21 2020, 17:59:51) [MSC v.1928 64 bit (AMD64)] 064bit on win32

<dict>
    <key>Images</key>
    <array>
        <dict>
            <key>ImageIndex</key>
            <integer>0</integer>
            <key>NumberOfROIs</key>
            <integer>42</integer>
            <key>ROIs</key>
            <array>
                <dict>
                    <key>Area</key>
                    <real>1.3665732145309448</real>
                    <key>Center</key>
                    <string>(0.000000, 0.000000, 0.000000)</string>
                    <key>Dev</key>
                    <real>66.487342834472656</real>
                    <key>IndexInImage</key>
                    <integer>36</integer>
                    <key>Max</key>
                    <real>1836</real>
                    <key>Mean</key>
                    <real>1583.29638671875</real>
                    <key>Min</key>
                    <real>1313</real>
                    <key>Name</key>
                    <string>Mass</string>
                    <key>NumberOfPoints</key>
                    <integer>89</integer>
                    <key>Point_mm</key>
                    <array>
                        <string>(0.000000, 0.000000, 0.000000)</string>
                        <string>(0.000000, 0.000000, 0.000000)</string>
                    </array>
                    <key>Point_px</key>
                    <array>
                        <string>(3196.290039, 1048.599976)</string>
                        <string>(3203.560059, 1046.170044)</string>
                        <string>(3211.330078, 1042.780029)</string>
                        <string>(3189.500000, 1050.540039)</string>
                    </array>
                    <key>Total</key>
                    <real>44457380</real>
                    <key>Type</key>
                    <integer>15</integer>
                </dict>
            </array>
        </dict>
    </array>
</dict>

Done.

CodePudding user response:

  1. Your XML has an extra plist tag.

  2. Your code even if it did work is only trying to remove string tag which has the "Calcification" text in it and not the dict like you tried.

  3. I have a working solution here - maybe not the most optimized code but works for sure I just tried it against your input

import xml.etree.ElementTree as ET

tree = ET.parse("sample.xml")
root = tree.getroot()
dict_list = []

array = root.find("./array/dict/array")

for each_dict in array.iter('dict'):
    for each_string in each_dict.iter('string'):
        if each_string.text == "Calcification":
            dict_list.append(each_dict)

for each_dict in dict_list:
    array.remove(each_dict)

tree.write('sample3.xml')
  • Related