Insert tag in XML text for mixed words-CodePudding

I'm looking to annotate text in XML from an index that represented named entities :

glossary = ['Paris', 'Vincennes', 'France', 'Guy de Maupassant', 'Maurice Ravel']

I found a solution on Stack (by @Brad Clements) to partially annotate the XML here : insert tags in ElementTree text

for the XML : <TEI><teiHeader></teiHeader><body>Paris est la capitale de France et Vincennes n'est pas loin. Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à Paris.</body></TEI>

It works for words shorter than 1 like "Paris", "Vincennes" or "France" but not for other words in the index.

My current output is : <?xml version="1.0"?> <TEI><teiHeader/><body><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin. Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à <entity>Paris.</entity> </body></TEI>

Expected output: <?xml version="1.0"?> <TEI><teiHeader/><body><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin. <entity>Guy de Maupassant</entity> et <entity>Maurice Ravel</entity> inaugurent une nouvelle statue à <entity>Paris.</entity> </body></TEI>

The code I tried to adapt (taken from insert tags in ElementTree text):

from lxml import etree
import string
import itertools

stylesheet = etree.XML("""
    <xsl:stylesheet version="1.0"
         xmlns:btest="uri:bolder"
         xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

        <xsl:template match="@*">
            <xsl:copy />
        </xsl:template>

        <xsl:template match="*">
            <xsl:element name="{name(.)}">
                <xsl:copy-of select="@*" />
                <xsl:apply-templates select="text()" />
                <xsl:apply-templates select="./*" />
            </xsl:element>
        </xsl:template>

        <xsl:template match="text()">
            <xsl:copy-of select="btest:bolder(.)/node()" />
        </xsl:template>         
     </xsl:stylesheet>
""")


glossary = ['Paris', 'Vincennes', 'France', 'Guy de Maupassant', 'Maurice Ravel']


def bolder(context, s):
    results = []
    r = None
    # - Iteration over the index and the words contained in the sequence
    for gw, word in itertools.zip_longest(glossary, s[0].split()):
        # - Little preprocessing for words with punctuation
        word_clean = word.translate(str.maketrans('', '', string.punctuation))
        # 1) If word in sequence directly match with glossary (index) add tag
        if word_clean in glossary:
            if r is not None:
                results.append(r)
            r = etree.Element('r')
            b = etree.SubElement(r, 'entity')
            b.text = word
            b.tail = ' '
            results.append(r)
            r = None

        # 2) Otherwise take the word from the index and check that it is contained in the sequence 
        # (if the word is composed eg. First name   last name 
        # and that it is not None)
        elif gw is not None and gw in s[0] and len(gw.split()) > 1:
            # repeat the process to annotate
            if r is not None:
                results.append(r)
            r = etree.Element('r')
            b = etree.SubElement(r, 'entity')
            b.text = gw
            b.tail = ' '
            results.append(r)
            r = None

        # 3) if none of the prerequisites, add text to output with no tag
        else:
            if r is None:
                r = etree.Element('r')
            r.text = '%s%s ' % (r.text or '', word)

        if r is not None:
            results.append(r)

    return results

def test():
    ns = etree.FunctionNamespace('uri:bolder') 
    ns['bolder'] = bolder 
    transform = etree.XSLT(stylesheet)
    new = str(transform(etree.XML("""<TEI><teiHeader></teiHeader><body><p>Paris est la capitale de France et Vincennes n'est pas loin.</p><p> Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à Paris.</p></body></TEI>""")))
    print(new)
    
if __name__ == "__main__":
    test()

But the output is still insufficient (repetition and omission) :

<?xml version="1.0"?>
<TEI><teiHeader/><body><p><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin. </p><p>Guy de Maupassant <entity>Guy de Maupassant</entity> <entity>Maurice Ravel</entity> Ravel inaugurent une nouvelle statue à <entity>Paris.</entity> </p></body></TEI>

How could I improve the above solution to correctly match mixed words which form only one entity? thanks in advance. Good day.

CodePudding user response：

Here is an example of how you could do it all within your XSLT stylesheet (assuming the libxslt processor):

XSLT 1.0 EXSLT str:tokenize()

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:str="http://exslt.org/strings"
extension-element-prefixes="str">
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>

<xsl:param name="glossary">Paris|Vincennes|France|Guy de Maupassant|Maurice Ravel</xsl:param>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="p/text()">
    <xsl:call-template name="tag-terms">
        <xsl:with-param name="string" select="."/>
        <xsl:with-param name="terms" select="str:tokenize($glossary, '|')"/>
    </xsl:call-template>
</xsl:template>

<xsl:template name="tag-terms">
    <xsl:param name="string"/>
    <xsl:param name="terms"/>
    <xsl:choose>
        <xsl:when test="$terms">
            <xsl:variable name="term" select="$terms[1]" />
            <xsl:choose>
                <xsl:when test="contains($string, $term)">
                    <!-- process substring-before with the remaining terms -->
                    <xsl:call-template name="tag-terms">
                        <xsl:with-param name="string" select="substring-before($string, $term)"/>
                        <xsl:with-param name="terms" select="$terms[position() > 1]"/>
                    </xsl:call-template>
                    <!-- matched term -->
                    <entity>
                        <xsl:value-of select="$term"/>
                    </entity>
                    <!-- continue with substring-after -->
                    <xsl:call-template name="tag-terms">
                        <xsl:with-param name="string" select="substring-after($string, $term)"/>
                        <xsl:with-param name="terms" select="$terms"/>
                    </xsl:call-template>
                </xsl:when>
                <xsl:otherwise>
                    <!-- pass the entire string for processing with the remaining terms -->
                    <xsl:call-template name="tag-terms">
                        <xsl:with-param name="string" select="$string"/>
                        <xsl:with-param name="terms" select="$terms[position() > 1]"/>
                    </xsl:call-template>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$string"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

</xsl:stylesheet>

Applied to your input example, this will return:

Result

<?xml version="1.0" encoding="utf-8"?>
<TEI>
  <teiHeader/>
  <body>
    <p><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin.</p>
    <p> <entity>Guy de Maupassant</entity> et <entity>Maurice Ravel</entity> inaugurent une nouvelle statue à <entity>Paris</entity>.</p>
  </body>
</TEI>

The glossary can be passed to the stylesheet at runtime as a delimited string.

CodePudding user response：

In XSLT 3 you could use analyze-string:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:fn="http://www.w3.org/2005/xpath-functions"
    expand-text="yes"
    exclude-result-prefixes="#all"
    version="3.0">
    
  <xsl:param name="entity-strings" as="xs:string*" select="'Paris', 'Vincennes', 'France', 'Guy de Maupassant', 'Maurice Ravel'"/>

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:template match="text()">
      <xsl:apply-templates select="analyze-string(., string-join($entity-strings, '|'))" mode="entity"/>
  </xsl:template>
  
  <xsl:template match="fn:match" mode="entity">
      <entity>{.}</entity>
  </xsl:template>
  
</xsl:stylesheet>

Saxon-C (any edition) from Saxonica has a Python API so it can be used with Python 3.

Or use a Python extension function to lxml as

from lxml import etree as ET
import re


stylesheet = ET.XML("""
    <xsl:stylesheet version="1.0"
         xmlns:btest="uri:bolder"
         xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

        <xsl:template match="@* | node()">
          <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
        </xsl:template>

        <xsl:template match="text()">
            <xsl:copy-of select="btest:bolder(.)/node()" />
        </xsl:template>         
     </xsl:stylesheet>
""")

def bolder(context, s):
    results = []
    splits = re.split(r'(Paris|Vincennes|France|Guy de Maupassant|Maurice Ravel)', s[0])
    for p, split in enumerate(splits):
      if p % 2 == 0:
        el = ET.Element("element")
        el.text = split
      else:
        el = ET.Element("element")
        entity = ET.SubElement(el, "entity")
        entity.text = split
      results.append(el)
    return results 
      
ns = ET.FunctionNamespace('uri:bolder') 
ns['bolder'] = bolder 
transform = ET.XSLT(stylesheet)
new = str(transform(ET.XML("""<TEI><teiHeader></teiHeader><body><p>Paris est la capitale de France et Vincennes n'est pas loin.</p><p> Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à Paris.</p></body></TEI>""")))
print(new)