Is there a python xml parser library that allows removing nodes without modifying the xml file layou-CodePudding

I have an xml code that looks like this:

<?xml version='1.0'?>
<datamodel version="7.0" 
           xmlns="http://www.tresos.de/_projects/DataModel2/16/root.xsd" 
           xmlns:a="http://www.tresos.de/_projects/DataModel2/16/attribute.xsd" 
           xmlns:v="http://www.tresos.de/_projects/DataModel2/06/schema.xsd" 
           xmlns:d="http://www.tresos.de/_projects/DataModel2/06/data.xsd">

  <d:ctr type="AUTOSAR" factory="autosar" 
         xmlns:ad="http://www.tresos.de/_projects/DataModel2/08/admindata.xsd" 
         xmlns:cd="http://www.tresos.de/_projects/DataModel2/08/customdata.xsd" 
         xmlns:f="http://www.tresos.de/_projects/DataModel2/14/formulaexpr.xsd" 
         xmlns:icc="http://www.tresos.de/_projects/DataModel2/08/implconfigclass.xsd" 
         xmlns:mt="http://www.tresos.de/_projects/DataModel2/11/multitest.xsd"  
         xmlns:variant="http://www.tresos.de/_projects/DataModel2/11/variant.xsd">
    <d:lst type="TOP-LEVEL-PACKAGES">
      <d:ctr name="Lin" type="AR-PACKAGE">
        <d:lst type="ELEMENTS">
          <d:chc name="Lin" type="AR-ELEMENT" value="MODULE-CONFIGURATION">
            <d:ctr type="MODULE-CONFIGURATION">
              <a:a name="DEF" value="ASPath:/TS_T16D27M10I10R0/EcucDefs/Lin"/>
              <a:a name="IMPORTER_INFO" value="ImportEcuConfig"/>
              <d:var name="IMPLEMENTATION_CONFIG_VARIANT" type="ENUMERATION" 
                     value="VariantPostBuild">
                <a:a name="IMPORTER_INFO" value="@DEF"/>
              </d:var>
              <d:lst name="LinDemEventParameterRefs" type="MAP"/>
              <d:ctr name="LinGeneral" type="IDENTIFIABLE">
                <d:var name="LinDevErrorDetect" type="BOOLEAN" value="true"/>
                <d:var name="LinMultiCoreErrorDetect" type="BOOLEAN" 
                       value="true"/>
                <d:var name="LinIndex" type="INTEGER" value="0">
                  <a:a name="IMPORTER_INFO" value="@DEF"/>
                </d:var>
                <d:var name="LinTimeoutDuration" type="INTEGER" value="300">
                  <a:a name="IMPORTER_INFO" value="@DEF"/>
                </d:var>
              </d:ctr>
            </d:ctr>
          </d:chc>
        </d:lst>
      </d:ctr>
    </d:lst>
  </d:ctr>
</datamodel>

I would like to find an easy parser in python that allows me to parse this file and remove some nodes, without modifying any indentation, spaces char etc.. I have tried different solutions:

xml.etree.ElementTree : messes up with the namespaces
lxml.etree : fine with the namespaces, but it reorganises the attributes and line length and indentation
simple sed script: does not deal well with elements that have children, that have explicit closing tag etc...

Ideally, I would just need a very simple stream parser that detects the start and end and just lets me remove the ones I do not want anymore.

CodePudding user response：

Pfeww, that was not as easy as I expected it to be. I ended up using xml.parser.expat to have access to the content of the string before the parsing, using the parser .GetInputContext().

Then, catching the various events and implementing two states: forward and discard, I could copy the input string as they are in the original source file.

I have added some goodies to:

support using short form of nodes when the content is entirely removed
remove leading indentation of removed node

known issue:

if there is ">" or "/>" in the content of the attributes of an element, this will fail (see todo in the below code)

import io
import xml.parsers.expat

class xmlFilter:
    def __init__(self, fout, parser, encoding="iso-8859-1"):
        self.fout = fout
        self.parser = parser

        self._pending_start_element = False
        self._depth_count = 0
        self._pending_character_data = b""

        self.switch_out_discard()

    def switch_out_discard(self):
        self.parser.CommentHandler = self.comment
        self.parser.StartElementHandler = self.start_element
        self.parser.EndElementHandler = self.end_element
        self.parser.CharacterDataHandler = self.character_data

    def switch_in_discard(self):
        self.parser.CommentHandler = self.comment_discard
        self.parser.StartElementHandler = self.start_element_discard
        self.parser.EndElementHandler = self.end_element_discard
        self.parser.CharacterDataHandler = self.character_data_discard

    def handle_pending_start_at_start(self):
        if self._pending_start_element:
            self.fout.write(b">")
            self._pending_start_element = False
        self.fout.write(self._pending_character_data)
        self._pending_character_data = b""

    def comment_discard(self, data):
        pass

    def start_element_discard(self, name, attrs):
        self._depth_count  = 1

    def end_element_discard(self, name):
        self._depth_count -= 1
        if self._depth_count == 0:
            self.switch_out_discard()

    def character_data_discard(self, data):
        pass

    def comment(self, data):
        self.handle_pending_start_at_start()
        self.fout.write(b"<!-- "   bytes(data, "UTF-8")   b"-->")

    def start_element(self, name, attrs):
        if "name" in attrs and attrs["name"] == "IMPORTER_INFO":
            self._depth_count = 1
            self.switch_in_discard()

            # we are explicitely not handling the pending start to be able to
            # create a short form node if there is no data

            # remove the trailing spaces to remove the indentation
            self._pending_character_data = self._pending_character_data.rstrip()
        else:
            self.handle_pending_start_at_start()

            ctx = self.parser.GetInputContext()
            # look for the next xml closing element
            # TODO: support the > and /> that are inside attributes
            brack1 = ctx.index(b">")
            # it is possible that there is no short close
            try:
                brack2 = ctx.index(b"/>")
            except ValueError:
                brack2 = 99999999
            if brack1 < brack2:
                data = ctx[:brack1]
            else:
                data = ctx[:brack2]
            self._pending_start_element = True
            self.fout.write(data)

    def end_element(self, name):
        if self._pending_start_element:
            if len(self._pending_character_data.rstrip()) == 0:
                self.fout.write(b"/>")
            else:
                self.fout.write(b">")
                self.fout.write(self._pending_character_data)
                self.fout.write(b"</"   bytes(name, "UTF-8")   b">")
            self._pending_start_element = False
        else:
            self.fout.write(self._pending_character_data)
            self.fout.write(b"</"   bytes(name, "UTF-8")   b">")
        self._pending_character_data = b""

    def character_data(self, data):
        self._pending_character_data  = bytes(data, "UTF-8")


data = b'''
'''
fout = io.BytesIO()
fout.write(b"<?xml version='1.0'?>\n")
parser = xml.parsers.expat.ParserCreate()
filter = xmlFilter(fout, parser)
parser.Parse(data)
print(fout.getvalue().decode("utf-8"))

Input is:

<?xml version='1.0'?>
<datamodel version="7.0" 
           xmlns="http://www.tresos.de/_projects/DataModel2/16/root.xsd" 
           xmlns:a="http://www.tresos.de/_projects/DataModel2/16/attribute.xsd" 
           xmlns:v="http://www.tresos.de/_projects/DataModel2/06/schema.xsd" 
           xmlns:d="http://www.tresos.de/_projects/DataModel2/06/data.xsd">

  <d:ctr type="AUTOSAR" factory="autosar" 
         xmlns:ad="http://www.tresos.de/_projects/DataModel2/08/admindata.xsd" 
         xmlns:cd="http://www.tresos.de/_projects/DataModel2/08/customdata.xsd" 
         xmlns:f="http://www.tresos.de/_projects/DataModel2/14/formulaexpr.xsd" 
         xmlns:icc="http://www.tresos.de/_projects/DataModel2/08/implconfigclass.xsd" 
         xmlns:mt="http://www.tresos.de/_projects/DataModel2/11/multitest.xsd"  
         xmlns:variant="http://www.tresos.de/_projects/DataModel2/11/variant.xsd">
    <d:lst type="TOP-LEVEL-PACKAGES">
      <d:ctr name="Lin" type="AR-PACKAGE">
        <d:lst type="ELEMENTS">
          <d:chc name="Lin" type="AR-ELEMENT" value="MODULE-CONFIGURATION">
            <d:ctr type="MODULE-CONFIGURATION">
              <a:a name="DEF" value="ASPath:/TS_T16D27M10I10R0/EcucDefs/Lin"/>
              <a:a name="IMPORTER_INFO" value="ImportEcuConfig"/>
              <d:var name="IMPLEMENTATION_CONFIG_VARIANT" type="ENUMERATION" 
                     value="VariantPostBuild">
                <a:a name="IMPORTER_INFO" value="@DEF"/>
              </d:var>
              <d:lst name="LinDemEventParameterRefs" type="MAP"/>
              <d:ctr name="LinGeneral" type="IDENTIFIABLE">
                <d:var name="LinDevErrorDetect" type="BOOLEAN" value="true"/>
                <d:var name="LinMultiCoreErrorDetect" type="BOOLEAN" 
                       value="true"/>
                <d:var name="LinIndex" type="INTEGER" value="0">
                  <a:a name="IMPORTER_INFO" value="@DEF"/>
                </d:var>
                <d:var name="LinTimeoutDuration" type="INTEGER" value="300">
                  <a:a name="IMPORTER_INFO" value="@DEF"/>
                </d:var>
                <d:var name="LinVersionInfoApi" type="BOOLEAN" value="false"/>
                <d:var name="LinHwMcuTrigSleepEnable" type="BOOLEAN" 
                       value="false"/>
                <d:ref name="LinSysClockRef" type="REFERENCE" 
                       value="ASPath:/Mcu/Mcu/McuModuleConfiguration/McuClockSettingConfig_0/McuClockReferencePointConfig"/>
                <d:var name="LinCsrClksel" type="ENUMERATION" value="ASCLINF"/>
                <d:var name="LinInitApiMode" type="ENUMERATION" 
                       value="LIN_MCAL_SUPERVISOR">
                  <a:a name="IMPORTER_INFO" value="@DEF"/>
                </d:var>
                <d:var name="LinMasterInterruptEnable" type="BOOLEAN" 
                       value="true">
                  <a:a name="IMPORTER_INFO" value="@DEF"/>
                </d:var>
              </d:ctr>
              <d:ctr name="LinGlobalConfig" type="IDENTIFIABLE">
                <a:a name="IMPORTER_INFO" value="ImportEcuConfig"/>
                <d:lst name="LinChannel" type="MAP">
                  <a:a name="IMPORTER_INFO" value="ImportEcuConfig"/>
                  <d:ctr name="LIN_A1" type="IDENTIFIABLE">
                  </d:ctr>
                </d:lst>
              </d:ctr>
            </d:ctr>
          </d:chc>
        </d:lst>
      </d:ctr>
    </d:lst>
  </d:ctr>
</datamodel>

Result is:

<?xml version='1.0'?>
<datamodel version="7.0"
           xmlns="http://www.tresos.de/_projects/DataModel2/16/root.xsd"
           xmlns:a="http://www.tresos.de/_projects/DataModel2/16/attribute.xsd"
           xmlns:v="http://www.tresos.de/_projects/DataModel2/06/schema.xsd"
           xmlns:d="http://www.tresos.de/_projects/DataModel2/06/data.xsd">

  <d:ctr type="AUTOSAR" factory="autosar"
         xmlns:ad="http://www.tresos.de/_projects/DataModel2/08/admindata.xsd"
         xmlns:cd="http://www.tresos.de/_projects/DataModel2/08/customdata.xsd"
         xmlns:f="http://www.tresos.de/_projects/DataModel2/14/formulaexpr.xsd"
         xmlns:icc="http://www.tresos.de/_projects/DataModel2/08/implconfigclass.xsd"
         xmlns:mt="http://www.tresos.de/_projects/DataModel2/11/multitest.xsd"
         xmlns:variant="http://www.tresos.de/_projects/DataModel2/11/variant.xsd">
    <d:lst type="TOP-LEVEL-PACKAGES">
      <d:ctr name="Lin" type="AR-PACKAGE">
        <d:lst type="ELEMENTS">
          <d:chc name="Lin" type="AR-ELEMENT" value="MODULE-CONFIGURATION">
            <d:ctr type="MODULE-CONFIGURATION">
              <a:a name="DEF" value="ASPath:/TS_T16D27M10I10R0/EcucDefs/Lin"/>
              <d:var name="IMPLEMENTATION_CONFIG_VARIANT" type="ENUMERATION"
                     value="VariantPostBuild"/>
              <d:lst name="LinDemEventParameterRefs" type="MAP"/>
              <d:ctr name="LinGeneral" type="IDENTIFIABLE">
                <d:var name="LinDevErrorDetect" type="BOOLEAN" value="true"/>
                <d:var name="LinMultiCoreErrorDetect" type="BOOLEAN"
                       value="true"/>
                <d:var name="LinIndex" type="INTEGER" value="0"/>
                <d:var name="LinTimeoutDuration" type="INTEGER" value="300"/>
                <d:var name="LinVersionInfoApi" type="BOOLEAN" value="false"/>
                <d:var name="LinHwMcuTrigSleepEnable" type="BOOLEAN"
                       value="false"/>
                <d:ref name="LinSysClockRef" type="REFERENCE"
                       value="ASPath:/Mcu/Mcu/McuModuleConfiguration/McuClockSettingConfig_0/McuClockReferencePointConfig"/>
                <d:var name="LinCsrClksel" type="ENUMERATION" value="ASCLINF"/>
                <d:var name="LinInitApiMode" type="ENUMERATION"
                       value="LIN_MCAL_SUPERVISOR"/>
                <d:var name="LinMasterInterruptEnable" type="BOOLEAN"
                       value="true"/>
              </d:ctr>
              <d:ctr name="LinGlobalConfig" type="IDENTIFIABLE">
                <d:lst name="LinChannel" type="MAP">
                  <d:ctr name="LIN_A1" type="IDENTIFIABLE"/>
                </d:lst>
              </d:ctr>
            </d:ctr>
          </d:chc>
        </d:lst>
      </d:ctr>
    </d:lst>
  </d:ctr>
</datamodel>

CodePudding user response：

Consider XSLT, the special-purpose language designed to transform XML files. Python can run XSLT 1.0 scripts with lxml but XSLT is not limited to Python but can run in any standalone processor or other programming languages.

Particularly, with the well-known Identity Transform template you can copy XML as is and then use the empty template to remove needed nodes. Below script removes all <a:a> nodes. As commented above and demo below shows, processors may change spacing/indentation (like indented attributes or attribute order) but content should be intact.

XSLT (save as .xsl, a special .xml file)

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns="http://www.tresos.de/_projects/DataModel2/16/root.xsd" 
                xmlns:a="http://www.tresos.de/_projects/DataModel2/16/attribute.xsd" 
                xmlns:v="http://www.tresos.de/_projects/DataModel2/06/schema.xsd" 
                xmlns:d="http://www.tresos.de/_projects/DataModel2/06/data.xsd"
                xmlns:ad="http://www.tresos.de/_projects/DataModel2/08/admindata.xsd" 
                xmlns:cd="http://www.tresos.de/_projects/DataModel2/08/customdata.xsd" 
                xmlns:f="http://www.tresos.de/_projects/DataModel2/14/formulaexpr.xsd" 
                xmlns:icc="http://www.tresos.de/_projects/DataModel2/08/implconfigclass.xsd" 
                xmlns:mt="http://www.tresos.de/_projects/DataModel2/11/multitest.xsd"  
                xmlns:variant="http://www.tresos.de/_projects/DataModel2/11/variant.xsd"
    version="1.0">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  
  <!-- IDENTITY TRANSFORM -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
  
  <!-- EMPTY TEMPLATE TO REMOVE NODES -->
  <xsl:template match="a:a"/>
  
</xsl:stylesheet>

Online Demo

Python

import lxml.etree as lx

# PARSE XML AND XSLT
doc = lx.parse("input.xml")
style = lx.parse("style.xsl")

# CONFIGURE AND RUN TRANSFORMER
transformer = lx.XSLT(style)
result = transformer(doc)

# PRINT TO SCREEN
print(result)

# OUTPUT TO FILE
result.write_output("output.xml")