XSLT 2 - convert single characters to processing-instruction-CodePudding

I have a text node that contains 7-bit ASCII text as well as higher unicode characters (eg x2011, xF0B7, x25CF ...)

I need to be able to (efficiently) convert these single high-unicode characters into processing-instructions

e.g.

&#x2011;  ->   <processing-instruction name="xxx">character output="hyphen"</pro...>
&#xF0B7;  ->   <processing-instruction name="xxx">character output="page"</pro...>

I've tried using xsl:tokenize which does split the text before/after the first token delimiter (e.g. x2011) but I end up with a variable containing 'text...<processing-instruction>...</processing-instruction'...text' which trips up the next xsl:token.

I managed to get the following approach to work but it looks really inelegant, and I'm sure there's a more efficient/better way to do this but I haven't found anything that works or is any better.

The first character replacement is easy, using replace(), as I'm only escaping the % (the target software uses the '%' for other things so needs to be escaped in this manner).

And yes, this would work for the x2011-to-< ... >, but the original intention was to convert to processing-instructions directly.

    <xsl:template match="text()">
        <xsl:variable name="SR1">
            <xsl:value-of select="fn:replace(., '%', '\\%')"/>
        </xsl:variable>
        <!-- unbreakable hyphen -->
        <xsl:variable name="SR2">
            <xsl:call-template name="tokenize">
                <xsl:with-param name="string" select="$SR1"/>
                <xsl:with-param name="delimiter">&#x2011;</xsl:with-param>
                <xsl:with-param name="PI"><xsl:text>&lt;?xpp character symbol="bxhyphen" hex="x2011" data="E28091"?&gt;</xsl:text></xsl:with-param>
            </xsl:call-template>
        </xsl:variable>
        <!-- page ref -->
        <xsl:variable name="SR3">
            <xsl:call-template name="tokenize">
                <xsl:with-param name="string" ><xsl:copy-of select="$SR2"/></xsl:with-param>
                <xsl:with-param name="delimiter">&#xF0B7;</xsl:with-param>
                <xsl:with-param name="PI"><xsl:text>&lt;?xpp character symbol="pgref" hex="xF0B7" data="EF82B7"?&gt;</xsl:text>
                </xsl:with-param>
            </xsl:call-template>
        </xsl:variable>
        <!-- page ref -->
        <xsl:variable name="SR4">
            <xsl:call-template name="tokenize">
                <xsl:with-param name="string" ><xsl:copy-of select="$SR3"/></xsl:with-param>
                <xsl:with-param name="delimiter">&#x25CF;</xsl:with-param>
                <xsl:with-param name="PI"><xsl:text>&lt;?xpp character symbol="bub" hex="x25CF" data="E2978F"?&gt;</xsl:text>
                </xsl:with-param>
            </xsl:call-template>
        </xsl:variable>
        <xsl:copy-of select="$SR4"/>
    </xsl:template>

Ideally, I was aiming to have a list of 'pairs', the hex unicode and its matching processing-instruction, but any better solution would be appreciated!

Another feature would be to flag characters that have not been processed, so any characters in the ranges x00-x1F, xFF (excluding x2011, x25CF xF0B7).

CodePudding user response：

A version without xsl:analyze-string is the following. It uses a separate file to store the codepoint/string relations.

So, in this example, a file called codes.xml contains the mapping(the hex values have to be converted to decimal first - here this is already done):

<CharKey>
    <Char cp="8209"  string="hyphen" />
    <Char cp="61623" string="page" />
</CharKey>

And the stylesheet (here it is XSLT-3.0, but it works with XSLT-2.0 also with some minor modifications) iterates over the codepoint of the string:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>    
    <xsl:variable name="codes" select="document('codes.xml')/CharKey" />
    <!-- text() node is matched here -->   
    <xsl:template match="/Record/Text">
        <xsl:variable name="cps" select="string-to-codepoints(.)" />
        <xsl:for-each select="$cps">
            <xsl:variable name="curCP" select="$codes/Char[@cp=current()]" />
            <xsl:choose>
                <xsl:when test="$curCP"><xsl:processing-instruction name="xxx" expand-text="yes">character output="{$curCP/@string}"</xsl:processing-instruction></xsl:when>
                <xsl:otherwise><xsl:value-of select="codepoints-to-string(.)" /></xsl:otherwise>
            </xsl:choose>                
        </xsl:for-each>
    </xsl:template>
    
</xsl:stylesheet>

This could be further simplified, but as an example it should work.
The sample's

<Record>
    <Text>Hello&#x2011;&#xF0B7;End</Text>
</Record>

output is

Hello<?xxx character output="hyphen"?><?xxx character output="page"?>End

CodePudding user response：

If the characters you are looking for are known and limited I would list them e.g. <xsl:template match="text()"><xsl:analyze-string select="." regex="‑●"><xsl:matching-substring><xsl:processing-instruction name="xxp" select="mf:map(.)"/></xsl:matching-substring><xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring></xsl:analyze-string></xsl:template> where mf:map is a function you set up that maps each character to the string you want to output as the data of the pi. In XSLT 3 I would probably store the character to name mapping in an XPath/XSLT map, in XSLT 2 you can use some xsl:param or xsl:variable e.g. <xsl:param name="characters-to-name"><map char="‑">bxhyphen</map>...</xsl:param> and select into that, if needed, even by setting up a key.