I have a text node that contains 7-bit ASCII text as well as higher unicode characters (eg x2011, xF0B7, x25CF ...)
I need to be able to (efficiently) convert these single high-unicode characters into processing-instructions
e.g.
‑ -> <processing-instruction name="xxx">character output="hyphen"</pro...>
 -> <processing-instruction name="xxx">character output="page"</pro...>
I've tried using xsl:tokenize
which does split the text before/after the first token delimiter (e.g. x2011) but I end up with a variable containing 'text...<processing-instruction>...</processing-instruction'...text'
which trips up the next xsl:token
.
I managed to get the following approach to work but it looks really inelegant, and I'm sure there's a more efficient/better way to do this but I haven't found anything that works or is any better.
The first character replacement is easy, using replace()
, as I'm only escaping the %
(the target software uses the '%' for other things so needs to be escaped in this manner).
And yes, this would work for the x2011-to-< ... >, but the original intention was to convert to processing-instructions directly.
<xsl:template match="text()">
<xsl:variable name="SR1">
<xsl:value-of select="fn:replace(., '%', '\\%')"/>
</xsl:variable>
<!-- unbreakable hyphen -->
<xsl:variable name="SR2">
<xsl:call-template name="tokenize">
<xsl:with-param name="string" select="$SR1"/>
<xsl:with-param name="delimiter">‑</xsl:with-param>
<xsl:with-param name="PI"><xsl:text><?xpp character symbol="bxhyphen" hex="x2011" data="E28091"?></xsl:text></xsl:with-param>
</xsl:call-template>
</xsl:variable>
<!-- page ref -->
<xsl:variable name="SR3">
<xsl:call-template name="tokenize">
<xsl:with-param name="string" ><xsl:copy-of select="$SR2"/></xsl:with-param>
<xsl:with-param name="delimiter"></xsl:with-param>
<xsl:with-param name="PI"><xsl:text><?xpp character symbol="pgref" hex="xF0B7" data="EF82B7"?></xsl:text>
</xsl:with-param>
</xsl:call-template>
</xsl:variable>
<!-- page ref -->
<xsl:variable name="SR4">
<xsl:call-template name="tokenize">
<xsl:with-param name="string" ><xsl:copy-of select="$SR3"/></xsl:with-param>
<xsl:with-param name="delimiter">●</xsl:with-param>
<xsl:with-param name="PI"><xsl:text><?xpp character symbol="bub" hex="x25CF" data="E2978F"?></xsl:text>
</xsl:with-param>
</xsl:call-template>
</xsl:variable>
<xsl:copy-of select="$SR4"/>
</xsl:template>
Ideally, I was aiming to have a list of 'pairs', the hex unicode and its matching processing-instruction, but any better solution would be appreciated!
Another feature would be to flag characters that have not been processed, so any characters in the ranges x00-x1F, xFF (excluding x2011, x25CF xF0B7).
CodePudding user response:
A version without xsl:analyze-string
is the following. It uses a separate file to store the codepoint/string relations.
So, in this example, a file called codes.xml
contains the mapping(the hex values have to be converted to decimal first - here this is already done):
<CharKey>
<Char cp="8209" string="hyphen" />
<Char cp="61623" string="page" />
</CharKey>
And the stylesheet (here it is XSLT-3.0, but it works with XSLT-2.0 also with some minor modifications) iterates over the codepoint of the string:
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:variable name="codes" select="document('codes.xml')/CharKey" />
<!-- text() node is matched here -->
<xsl:template match="/Record/Text">
<xsl:variable name="cps" select="string-to-codepoints(.)" />
<xsl:for-each select="$cps">
<xsl:variable name="curCP" select="$codes/Char[@cp=current()]" />
<xsl:choose>
<xsl:when test="$curCP"><xsl:processing-instruction name="xxx" expand-text="yes">character output="{$curCP/@string}"</xsl:processing-instruction></xsl:when>
<xsl:otherwise><xsl:value-of select="codepoints-to-string(.)" /></xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
This could be further simplified, but as an example it should work.
The sample's
<Record>
<Text>Hello‑End</Text>
</Record>
output is
Hello<?xxx character output="hyphen"?><?xxx character output="page"?>End
CodePudding user response:
If the characters you are looking for are known and limited I would list them e.g. <xsl:template match="text()"><xsl:analyze-string select="." regex="‑●"><xsl:matching-substring><xsl:processing-instruction name="xxp" select="mf:map(.)"/></xsl:matching-substring><xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring></xsl:analyze-string></xsl:template>
where mf:map
is a function you set up that maps each character to the string you want to output as the data of the pi. In XSLT 3 I would probably store the character to name mapping in an XPath/XSLT map, in XSLT 2 you can use some xsl:param
or xsl:variable
e.g. <xsl:param name="characters-to-name"><map char="‑">bxhyphen</map>...</xsl:param>
and select into that, if needed, even by setting up a key.