How to retain custom entities like – while transforming the xml using xslt-CodePudding

I'm trying to transforming xml using xslt, here is the example I'm using....

Input xml:

<!DOCTYPE printArtifactGroup [<!ENTITY ndash "&#38;#38;ndash;">]>
<group>
   <begin>
      <head>
         <text>(VOLS 0200)</text>
      </head>
      <data>
         <text>Health 161&ndash;1 to 16&ndash;32&ndash;End 2006</text>
      </data>
   </begin>
</group>

xslt:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
   <xsl:output encoding="utf8" method="xml" indent="yes" />


   <xsl:template match="@*|node()">
      <xsl:copy>
         <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
   </xsl:template>

   <xsl:template match="begin">
      <xsl:copy-of select="data" />
   </xsl:template>

</xsl:stylesheet>

python code to run xslt transformation:

import lxml.etree as ET
from lxml import etree

parser = etree.XMLParser(load_dtd=True, resolve_entities=True, huge_tree=True)

dom = ET.parse('test.xml', parser)
xslt = ET.parse('test.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
processed_file='processed.xml'
with open(processed_file, 'w') as file:
    file.write(str(newdom))
print(newdom)
print('Task Done')

Actual output:

<?xml version="1.0"?>
<group>
   <data>
         <text>Health 161&amp;ndash;1 to 16&amp;ndash;32&amp;ndash;End 2006</text>
      </data>
</group>

expected output:

<group>
   <begin>
      <data>
         <text>Health 161&ndash;1 to 16&ndash;32&ndash;End 2006</text>
      </data>
   </begin>
</group>

xml parser is resolving the &(ampersand) entity to & --- when we have custom entities – it is converting to &ndash; this is the default behavior, but we have a huge xml and when comparing the transformed xml with source it is difficult when entities are changed.

is there anyway we can generate the expected output by retaining the original entites.

Thanks in advance, any idea or suggestions are really appriciated.

CodePudding user response：

Here's a trick: Replace the internal entity definition with

[<!ENTITY ndash "<ndash/>">]

and replace the match="begin" template with two templates:

<xsl:template match="begin">
  <xsl:apply-templates select="data" />
</xsl:template>
<xsl:template match="ndash">
  <xsl:text disable-output-escaping="yes">&amp;ndash;</xsl:text>
</xsl:template>

During parsing of the source XML, this will convert –, which is a character in HTML, with an XML element <ndash/>. And during XSLT processing, this element is replaced with a sequence of characters. The disable-output-escaping prevents the XSLT processor from outputting this as &ndash;.

CodePudding user response：

The (modern) XSLT way (XSLT 2 and later, available to Python with the Python API of SaxonC) would to use

<!DOCTYPE printArtifactGroup [<!ENTITY ndash "&#8211;">]>
<group>
   <begin>
      <head>
         <text>(VOLS 0200)</text>
      </head>
      <data>
         <text>Health 161&ndash;1 to 16&ndash;32&ndash;End 2006</text>
      </data>
   </begin>
</group>

and then an xsl:character-map e.g.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="3.0"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                exclude-result-prefixes="#all">

    <xsl:character-map name="characters-to-entities">
      <xsl:output-character character="&#8211;" string="&amp;ndash;"/> 
    </xsl:character-map>
        
    <xsl:output use-character-maps="characters-to-entities"/>
    
    <xsl:mode on-no-match="shallow-copy"/>

</xsl:stylesheet>

That way the text element is output as e.g. <text>Health 161–1 to 16–32–End 2006</text>.

(Note: I have solely presented the entity/character map issue, I have not tried to implement the other part of your transformation in that sample).