XML editing Python-CodePudding

I'm looking for some help please cleaning up XML files, in python. Below is just a little snippet of code from 50 thousands lines of code. I have many XML files of the same sort of data.

xml = """
<?xml version="1.0" encoding="utf-8"?>
<file>
  <SORT_INFO>
    <sort_type>sort order</sort_type>
  </SORT_INFO>
  <ALL_INSTANCES>
    <instance>
      <ID>1</ID>
      <start>0</start>
      <end>17.96</end>
      <code>14. Jordan Brian Henderson</code>
      <label>
        <group>Team</group>
        <text>Liverpool FC</text>
      </label>
      <label>
        <group>Action</group>
        <text>Passes accurate</text>
      </label>
      <label>
        <group>Half</group>
        <text>1st half</text>
      </label>
      <pos_x>52.4</pos_x>
      <pos_y>34.0</pos_y>
    </instance>
    <instance>
      <ID>7</ID>
      <start>7.96</start>
      <end>8.96</end>
      <code>Start</code>
    </instance>
    <instance>
      <ID>8</ID>
      <start>10.28</start>
      <end>30.28</end>
      <code>26. Andrew Robertson</code>
      <label>
        <group>Team</group>
        <text>Liverpool FC</text>
      </label>
      <label>
        <group>Action</group>
        <text>Passes accurate</text>
      </label>
      <label>
        <group>Half</group>
        <text>1st half</text>
      </label>
      <pos_x>61.7</pos_x>
      <pos_y>68.0</pos_y>
    </instance>
    <instance>
      <ID>1321</ID>
      <start>3770.67</start>
      <end>3790.67</end>
      <code>3. Fabinho</code>
      <label>
        <group>Team</group>
        <text>Liverpool FC</text>
      </label>
      <label>
        <group>Action</group>
        <text>Passes accurate</text>
      </label>
      <label>
        <group>Half</group>
        <text>2nd half</text>
      </label>
      <pos_x>62.7</pos_x>
      <pos_y>3.7</pos_y>
    </instance>
    <instance>
      <ID>1882</ID>
      <start>5695.17</start>
      <end>5715.17</end>
      <code>2. Fabio Cardoso</code>
      <label>
        <group>Team</group>
        <text>Porto</text>
      </label>
      <label>
        <group>Action</group>
        <text>Interceptions</text>
      </label>
      <label>
        <group>Half</group>
        <text>2nd half</text>
      </label>
      <pos_x>8.1</pos_x>
      <pos_y>46.3</pos_y>
    </instance>
  </ALL_INSTANCES>
  <ROWS>
    <row>
      <code>20. Vitinha</code>
      <sort_order>15</sort_order>
      <R>51400</R>
      <G>51400</G>
      <B>51400</B>
    </row>
    <row>
      <code>11. Pepe</code>
      <sort_order>16</sort_order>
      <R>51400</R>
      <G>51400</G>
      <B>51400</B>
    </row>
  </ROWS>
</file>
"""

I'd like to remove everything before <ALL_INSTANCES> and everything after </ALL_INSTANCES>

I'd also like to remove any of the instance tags that include <code>Start</code>

Would it be possible to do this for all XML's in a folder?

Thanks

CodePudding user response：

See below (grab the nodes you need , create a new xml doc and dump it)

import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0" encoding="UTF-8"?>
<file>
   <SORT_INFO>
      <sort_type>sort order</sort_type>
   </SORT_INFO>
   <ALL_INSTANCES>
      <instance>
         <ID>1</ID>
         <start>0</start>
         <end>17.96</end>
         <code>14. Jordan Brian Henderson</code>
         <label>
            <group>Team</group>
            <text>Liverpool FC</text>
         </label>
         <label>
            <group>Action</group>
            <text>Passes accurate</text>
         </label>
         <label>
            <group>Half</group>
            <text>1st half</text>
         </label>
         <pos_x>52.4</pos_x>
         <pos_y>34.0</pos_y>
      </instance>
      <instance>
         <ID>7</ID>
         <start>7.96</start>
         <end>8.96</end>
         <code>Start</code>
      </instance>
      <instance>
         <ID>8</ID>
         <start>10.28</start>
         <end>30.28</end>
         <code>26. Andrew Robertson</code>
         <label>
            <group>Team</group>
            <text>Liverpool FC</text>
         </label>
         <label>
            <group>Action</group>
            <text>Passes accurate</text>
         </label>
         <label>
            <group>Half</group>
            <text>1st half</text>
         </label>
         <pos_x>61.7</pos_x>
         <pos_y>68.0</pos_y>
      </instance>
      <instance>
         <ID>1321</ID>
         <start>3770.67</start>
         <end>3790.67</end>
         <code>3. Fabinho</code>
         <label>
            <group>Team</group>
            <text>Liverpool FC</text>
         </label>
         <label>
            <group>Action</group>
            <text>Passes accurate</text>
         </label>
         <label>
            <group>Half</group>
            <text>2nd half</text>
         </label>
         <pos_x>62.7</pos_x>
         <pos_y>3.7</pos_y>
      </instance>
      <instance>
         <ID>1882</ID>
         <start>5695.17</start>
         <end>5715.17</end>
         <code>2. Fabio Cardoso</code>
         <label>
            <group>Team</group>
            <text>Porto</text>
         </label>
         <label>
            <group>Action</group>
            <text>Interceptions</text>
         </label>
         <label>
            <group>Half</group>
            <text>2nd half</text>
         </label>
         <pos_x>8.1</pos_x>
         <pos_y>46.3</pos_y>
      </instance>
   </ALL_INSTANCES>
   <ROWS>
      <row>
         <code>20. Vitinha</code>
         <sort_order>15</sort_order>
         <R>51400</R>
         <G>51400</G>
         <B>51400</B>
      </row>
      <row>
         <code>11. Pepe</code>
         <sort_order>16</sort_order>
         <R>51400</R>
         <G>51400</G>
         <B>51400</B>
      </row>
   </ROWS>
</file>'''

root = ET.fromstring(xml)
instances = root.find('ALL_INSTANCES')
for instance in instances.findall('instance'):
    if instance.find('code').text == 'Start':
        instances.remove(instance)
file  = ET.Element('file')
file.append(instances)
ET.dump(file)

output

<?xml version="1.0" encoding="UTF-8"?>
<file>
   <ALL_INSTANCES>
      <instance>
         <ID>1</ID>
         <start>0</start>
         <end>17.96</end>
         <code>14. Jordan Brian Henderson</code>
         <label>
            <group>Team</group>
            <text>Liverpool FC</text>
         </label>
         <label>
            <group>Action</group>
            <text>Passes accurate</text>
         </label>
         <label>
            <group>Half</group>
            <text>1st half</text>
         </label>
         <pos_x>52.4</pos_x>
         <pos_y>34.0</pos_y>
      </instance>
      <instance>
         <ID>8</ID>
         <start>10.28</start>
         <end>30.28</end>
         <code>26. Andrew Robertson</code>
         <label>
            <group>Team</group>
            <text>Liverpool FC</text>
         </label>
         <label>
            <group>Action</group>
            <text>Passes accurate</text>
         </label>
         <label>
            <group>Half</group>
            <text>1st half</text>
         </label>
         <pos_x>61.7</pos_x>
         <pos_y>68.0</pos_y>
      </instance>
      <instance>
         <ID>1321</ID>
         <start>3770.67</start>
         <end>3790.67</end>
         <code>3. Fabinho</code>
         <label>
            <group>Team</group>
            <text>Liverpool FC</text>
         </label>
         <label>
            <group>Action</group>
            <text>Passes accurate</text>
         </label>
         <label>
            <group>Half</group>
            <text>2nd half</text>
         </label>
         <pos_x>62.7</pos_x>
         <pos_y>3.7</pos_y>
      </instance>
      <instance>
         <ID>1882</ID>
         <start>5695.17</start>
         <end>5715.17</end>
         <code>2. Fabio Cardoso</code>
         <label>
            <group>Team</group>
            <text>Porto</text>
         </label>
         <label>
            <group>Action</group>
            <text>Interceptions</text>
         </label>
         <label>
            <group>Half</group>
            <text>2nd half</text>
         </label>
         <pos_x>8.1</pos_x>
         <pos_y>46.3</pos_y>
      </instance>
   </ALL_INSTANCES>
</file>

CodePudding user response：

Consider XSLT, the special-purpose language designed to transform XML files. Python's lxml can run XSLT 1.0 scripts. Alternatively, you can have Python run external XSLT processors.

Specifically, below XSLT runs the Identity Transform to copy document as is and then re-designs <file> (i.e., root) to only return its <ALL_INSTANCES> child. A second empty template is run for the conditional XPath logic, instance[code='Start'], to remove such nodes from tree. No loops required for this approach!

XSLT (save as .xsl file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" encoding="utf-8" indent="yes"/>
  <xsl:strip-space elements = "*"/>
  
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy> 
  </xsl:template>
  
  <xsl:template match="file">
      <xsl:copy>
          <xsl:apply-templates select="ALL_INSTANCES"/>
      </xsl:copy>  
  </xsl:template>
  
  <xsl:template match="instance[code='Start']"/>

</xsl:stylesheet>

Online Demo

Python

import lxml.etree as lx

# PARSE XML AND XSLT
doc = lx.parse("Input.xml")
style = lx.parse("Style.xsl")

# CONFIGURE AND RUN TRANSFORMER
transformer = lx.XSLT(style)
result = transformer(doc)

# OUTPUT TO FILE
result.write_output("Output.xml")