I have an xml code that looks like this:
<?xml version='1.0'?>
<datamodel version="7.0"
xmlns="http://www.tresos.de/_projects/DataModel2/16/root.xsd"
xmlns:a="http://www.tresos.de/_projects/DataModel2/16/attribute.xsd"
xmlns:v="http://www.tresos.de/_projects/DataModel2/06/schema.xsd"
xmlns:d="http://www.tresos.de/_projects/DataModel2/06/data.xsd">
<d:ctr type="AUTOSAR" factory="autosar"
xmlns:ad="http://www.tresos.de/_projects/DataModel2/08/admindata.xsd"
xmlns:cd="http://www.tresos.de/_projects/DataModel2/08/customdata.xsd"
xmlns:f="http://www.tresos.de/_projects/DataModel2/14/formulaexpr.xsd"
xmlns:icc="http://www.tresos.de/_projects/DataModel2/08/implconfigclass.xsd"
xmlns:mt="http://www.tresos.de/_projects/DataModel2/11/multitest.xsd"
xmlns:variant="http://www.tresos.de/_projects/DataModel2/11/variant.xsd">
<d:lst type="TOP-LEVEL-PACKAGES">
<d:ctr name="Lin" type="AR-PACKAGE">
<d:lst type="ELEMENTS">
<d:chc name="Lin" type="AR-ELEMENT" value="MODULE-CONFIGURATION">
<d:ctr type="MODULE-CONFIGURATION">
<a:a name="DEF" value="ASPath:/TS_T16D27M10I10R0/EcucDefs/Lin"/>
<a:a name="IMPORTER_INFO" value="ImportEcuConfig"/>
<d:var name="IMPLEMENTATION_CONFIG_VARIANT" type="ENUMERATION"
value="VariantPostBuild">
<a:a name="IMPORTER_INFO" value="@DEF"/>
</d:var>
<d:lst name="LinDemEventParameterRefs" type="MAP"/>
<d:ctr name="LinGeneral" type="IDENTIFIABLE">
<d:var name="LinDevErrorDetect" type="BOOLEAN" value="true"/>
<d:var name="LinMultiCoreErrorDetect" type="BOOLEAN"
value="true"/>
<d:var name="LinIndex" type="INTEGER" value="0">
<a:a name="IMPORTER_INFO" value="@DEF"/>
</d:var>
<d:var name="LinTimeoutDuration" type="INTEGER" value="300">
<a:a name="IMPORTER_INFO" value="@DEF"/>
</d:var>
</d:ctr>
</d:ctr>
</d:chc>
</d:lst>
</d:ctr>
</d:lst>
</d:ctr>
</datamodel>
I would like to find an easy parser in python that allows me to parse this file and remove some nodes, without modifying any indentation, spaces char etc.. I have tried different solutions:
- xml.etree.ElementTree : messes up with the namespaces
- lxml.etree : fine with the namespaces, but it reorganises the attributes and line length and indentation
- simple sed script: does not deal well with elements that have children, that have explicit closing tag etc...
Ideally, I would just need a very simple stream parser that detects the start and end and just lets me remove the ones I do not want anymore.
CodePudding user response:
Pfeww, that was not as easy as I expected it to be. I ended up using xml.parser.expat
to have access to the content of the string before the parsing, using the parser .GetInputContext()
.
Then, catching the various events and implementing two states: forward and discard, I could copy the input string as they are in the original source file.
I have added some goodies to:
- support using short form of nodes when the content is entirely removed
- remove leading indentation of removed node
known issue:
- if there is ">" or "/>" in the content of the attributes of an element, this will fail (see todo in the below code)
import io
import xml.parsers.expat
class xmlFilter:
def __init__(self, fout, parser, encoding="iso-8859-1"):
self.fout = fout
self.parser = parser
self._pending_start_element = False
self._depth_count = 0
self._pending_character_data = b""
self.switch_out_discard()
def switch_out_discard(self):
self.parser.CommentHandler = self.comment
self.parser.StartElementHandler = self.start_element
self.parser.EndElementHandler = self.end_element
self.parser.CharacterDataHandler = self.character_data
def switch_in_discard(self):
self.parser.CommentHandler = self.comment_discard
self.parser.StartElementHandler = self.start_element_discard
self.parser.EndElementHandler = self.end_element_discard
self.parser.CharacterDataHandler = self.character_data_discard
def handle_pending_start_at_start(self):
if self._pending_start_element:
self.fout.write(b">")
self._pending_start_element = False
self.fout.write(self._pending_character_data)
self._pending_character_data = b""
def comment_discard(self, data):
pass
def start_element_discard(self, name, attrs):
self._depth_count = 1
def end_element_discard(self, name):
self._depth_count -= 1
if self._depth_count == 0:
self.switch_out_discard()
def character_data_discard(self, data):
pass
def comment(self, data):
self.handle_pending_start_at_start()
self.fout.write(b"<!-- " bytes(data, "UTF-8") b"-->")
def start_element(self, name, attrs):
if "name" in attrs and attrs["name"] == "IMPORTER_INFO":
self._depth_count = 1
self.switch_in_discard()
# we are explicitely not handling the pending start to be able to
# create a short form node if there is no data
# remove the trailing spaces to remove the indentation
self._pending_character_data = self._pending_character_data.rstrip()
else:
self.handle_pending_start_at_start()
ctx = self.parser.GetInputContext()
# look for the next xml closing element
# TODO: support the > and /> that are inside attributes
brack1 = ctx.index(b">")
# it is possible that there is no short close
try:
brack2 = ctx.index(b"/>")
except ValueError:
brack2 = 99999999
if brack1 < brack2:
data = ctx[:brack1]
else:
data = ctx[:brack2]
self._pending_start_element = True
self.fout.write(data)
def end_element(self, name):
if self._pending_start_element:
if len(self._pending_character_data.rstrip()) == 0:
self.fout.write(b"/>")
else:
self.fout.write(b">")
self.fout.write(self._pending_character_data)
self.fout.write(b"</" bytes(name, "UTF-8") b">")
self._pending_start_element = False
else:
self.fout.write(self._pending_character_data)
self.fout.write(b"</" bytes(name, "UTF-8") b">")
self._pending_character_data = b""
def character_data(self, data):
self._pending_character_data = bytes(data, "UTF-8")
data = b'''
'''
fout = io.BytesIO()
fout.write(b"<?xml version='1.0'?>\n")
parser = xml.parsers.expat.ParserCreate()
filter = xmlFilter(fout, parser)
parser.Parse(data)
print(fout.getvalue().decode("utf-8"))
Input is:
<?xml version='1.0'?>
<datamodel version="7.0"
xmlns="http://www.tresos.de/_projects/DataModel2/16/root.xsd"
xmlns:a="http://www.tresos.de/_projects/DataModel2/16/attribute.xsd"
xmlns:v="http://www.tresos.de/_projects/DataModel2/06/schema.xsd"
xmlns:d="http://www.tresos.de/_projects/DataModel2/06/data.xsd">
<d:ctr type="AUTOSAR" factory="autosar"
xmlns:ad="http://www.tresos.de/_projects/DataModel2/08/admindata.xsd"
xmlns:cd="http://www.tresos.de/_projects/DataModel2/08/customdata.xsd"
xmlns:f="http://www.tresos.de/_projects/DataModel2/14/formulaexpr.xsd"
xmlns:icc="http://www.tresos.de/_projects/DataModel2/08/implconfigclass.xsd"
xmlns:mt="http://www.tresos.de/_projects/DataModel2/11/multitest.xsd"
xmlns:variant="http://www.tresos.de/_projects/DataModel2/11/variant.xsd">
<d:lst type="TOP-LEVEL-PACKAGES">
<d:ctr name="Lin" type="AR-PACKAGE">
<d:lst type="ELEMENTS">
<d:chc name="Lin" type="AR-ELEMENT" value="MODULE-CONFIGURATION">
<d:ctr type="MODULE-CONFIGURATION">
<a:a name="DEF" value="ASPath:/TS_T16D27M10I10R0/EcucDefs/Lin"/>
<a:a name="IMPORTER_INFO" value="ImportEcuConfig"/>
<d:var name="IMPLEMENTATION_CONFIG_VARIANT" type="ENUMERATION"
value="VariantPostBuild">
<a:a name="IMPORTER_INFO" value="@DEF"/>
</d:var>
<d:lst name="LinDemEventParameterRefs" type="MAP"/>
<d:ctr name="LinGeneral" type="IDENTIFIABLE">
<d:var name="LinDevErrorDetect" type="BOOLEAN" value="true"/>
<d:var name="LinMultiCoreErrorDetect" type="BOOLEAN"
value="true"/>
<d:var name="LinIndex" type="INTEGER" value="0">
<a:a name="IMPORTER_INFO" value="@DEF"/>
</d:var>
<d:var name="LinTimeoutDuration" type="INTEGER" value="300">
<a:a name="IMPORTER_INFO" value="@DEF"/>
</d:var>
<d:var name="LinVersionInfoApi" type="BOOLEAN" value="false"/>
<d:var name="LinHwMcuTrigSleepEnable" type="BOOLEAN"
value="false"/>
<d:ref name="LinSysClockRef" type="REFERENCE"
value="ASPath:/Mcu/Mcu/McuModuleConfiguration/McuClockSettingConfig_0/McuClockReferencePointConfig"/>
<d:var name="LinCsrClksel" type="ENUMERATION" value="ASCLINF"/>
<d:var name="LinInitApiMode" type="ENUMERATION"
value="LIN_MCAL_SUPERVISOR">
<a:a name="IMPORTER_INFO" value="@DEF"/>
</d:var>
<d:var name="LinMasterInterruptEnable" type="BOOLEAN"
value="true">
<a:a name="IMPORTER_INFO" value="@DEF"/>
</d:var>
</d:ctr>
<d:ctr name="LinGlobalConfig" type="IDENTIFIABLE">
<a:a name="IMPORTER_INFO" value="ImportEcuConfig"/>
<d:lst name="LinChannel" type="MAP">
<a:a name="IMPORTER_INFO" value="ImportEcuConfig"/>
<d:ctr name="LIN_A1" type="IDENTIFIABLE">
</d:ctr>
</d:lst>
</d:ctr>
</d:ctr>
</d:chc>
</d:lst>
</d:ctr>
</d:lst>
</d:ctr>
</datamodel>
Result is:
<?xml version='1.0'?>
<datamodel version="7.0"
xmlns="http://www.tresos.de/_projects/DataModel2/16/root.xsd"
xmlns:a="http://www.tresos.de/_projects/DataModel2/16/attribute.xsd"
xmlns:v="http://www.tresos.de/_projects/DataModel2/06/schema.xsd"
xmlns:d="http://www.tresos.de/_projects/DataModel2/06/data.xsd">
<d:ctr type="AUTOSAR" factory="autosar"
xmlns:ad="http://www.tresos.de/_projects/DataModel2/08/admindata.xsd"
xmlns:cd="http://www.tresos.de/_projects/DataModel2/08/customdata.xsd"
xmlns:f="http://www.tresos.de/_projects/DataModel2/14/formulaexpr.xsd"
xmlns:icc="http://www.tresos.de/_projects/DataModel2/08/implconfigclass.xsd"
xmlns:mt="http://www.tresos.de/_projects/DataModel2/11/multitest.xsd"
xmlns:variant="http://www.tresos.de/_projects/DataModel2/11/variant.xsd">
<d:lst type="TOP-LEVEL-PACKAGES">
<d:ctr name="Lin" type="AR-PACKAGE">
<d:lst type="ELEMENTS">
<d:chc name="Lin" type="AR-ELEMENT" value="MODULE-CONFIGURATION">
<d:ctr type="MODULE-CONFIGURATION">
<a:a name="DEF" value="ASPath:/TS_T16D27M10I10R0/EcucDefs/Lin"/>
<d:var name="IMPLEMENTATION_CONFIG_VARIANT" type="ENUMERATION"
value="VariantPostBuild"/>
<d:lst name="LinDemEventParameterRefs" type="MAP"/>
<d:ctr name="LinGeneral" type="IDENTIFIABLE">
<d:var name="LinDevErrorDetect" type="BOOLEAN" value="true"/>
<d:var name="LinMultiCoreErrorDetect" type="BOOLEAN"
value="true"/>
<d:var name="LinIndex" type="INTEGER" value="0"/>
<d:var name="LinTimeoutDuration" type="INTEGER" value="300"/>
<d:var name="LinVersionInfoApi" type="BOOLEAN" value="false"/>
<d:var name="LinHwMcuTrigSleepEnable" type="BOOLEAN"
value="false"/>
<d:ref name="LinSysClockRef" type="REFERENCE"
value="ASPath:/Mcu/Mcu/McuModuleConfiguration/McuClockSettingConfig_0/McuClockReferencePointConfig"/>
<d:var name="LinCsrClksel" type="ENUMERATION" value="ASCLINF"/>
<d:var name="LinInitApiMode" type="ENUMERATION"
value="LIN_MCAL_SUPERVISOR"/>
<d:var name="LinMasterInterruptEnable" type="BOOLEAN"
value="true"/>
</d:ctr>
<d:ctr name="LinGlobalConfig" type="IDENTIFIABLE">
<d:lst name="LinChannel" type="MAP">
<d:ctr name="LIN_A1" type="IDENTIFIABLE"/>
</d:lst>
</d:ctr>
</d:ctr>
</d:chc>
</d:lst>
</d:ctr>
</d:lst>
</d:ctr>
</datamodel>
CodePudding user response:
Consider XSLT, the special-purpose language designed to transform XML files. Python can run XSLT 1.0 scripts with lxml
but XSLT is not limited to Python but can run in any standalone processor or other programming languages.
Particularly, with the well-known Identity Transform template you can copy XML as is and then use the empty template to remove needed nodes. Below script removes all <a:a>
nodes. As commented above and demo below shows, processors may change spacing/indentation (like indented attributes or attribute order) but content should be intact.
XSLT (save as .xsl, a special .xml file)
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.tresos.de/_projects/DataModel2/16/root.xsd"
xmlns:a="http://www.tresos.de/_projects/DataModel2/16/attribute.xsd"
xmlns:v="http://www.tresos.de/_projects/DataModel2/06/schema.xsd"
xmlns:d="http://www.tresos.de/_projects/DataModel2/06/data.xsd"
xmlns:ad="http://www.tresos.de/_projects/DataModel2/08/admindata.xsd"
xmlns:cd="http://www.tresos.de/_projects/DataModel2/08/customdata.xsd"
xmlns:f="http://www.tresos.de/_projects/DataModel2/14/formulaexpr.xsd"
xmlns:icc="http://www.tresos.de/_projects/DataModel2/08/implconfigclass.xsd"
xmlns:mt="http://www.tresos.de/_projects/DataModel2/11/multitest.xsd"
xmlns:variant="http://www.tresos.de/_projects/DataModel2/11/variant.xsd"
version="1.0">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- EMPTY TEMPLATE TO REMOVE NODES -->
<xsl:template match="a:a"/>
</xsl:stylesheet>
Python
import lxml.etree as lx
# PARSE XML AND XSLT
doc = lx.parse("input.xml")
style = lx.parse("style.xsl")
# CONFIGURE AND RUN TRANSFORMER
transformer = lx.XSLT(style)
result = transformer(doc)
# PRINT TO SCREEN
print(result)
# OUTPUT TO FILE
result.write_output("output.xml")