Home > other >  Insert tag in XML text for mixed words
Insert tag in XML text for mixed words

Time:09-16

I'm looking to annotate text in XML from an index that represented named entities :

glossary = ['Paris', 'Vincennes', 'France', 'Guy de Maupassant', 'Maurice Ravel']

I found a solution on Stack (by @Brad Clements) to partially annotate the XML here : insert tags in ElementTree text

for the XML : <TEI><teiHeader></teiHeader><body><p>Paris est la capitale de France et Vincennes n'est pas loin.</p><p> Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à Paris.</p></body></TEI>

It works for words shorter than 1 like "Paris", "Vincennes" or "France" but not for other words in the index.

My current output is : <?xml version="1.0"?> <TEI><teiHeader/><body><p><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin. </p><p>Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à <entity>Paris.</entity> </p></body></TEI>

Expected output: <?xml version="1.0"?> <TEI><teiHeader/><body><p><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin. </p><p><entity>Guy de Maupassant</entity> et <entity>Maurice Ravel</entity> inaugurent une nouvelle statue à <entity>Paris.</entity> </p></body></TEI>

The code I tried to adapt (taken from insert tags in ElementTree text):

from lxml import etree
import string
import itertools

stylesheet = etree.XML("""
    <xsl:stylesheet version="1.0"
         xmlns:btest="uri:bolder"
         xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

        <xsl:template match="@*">
            <xsl:copy />
        </xsl:template>

        <xsl:template match="*">
            <xsl:element name="{name(.)}">
                <xsl:copy-of select="@*" />
                <xsl:apply-templates select="text()" />
                <xsl:apply-templates select="./*" />
            </xsl:element>
        </xsl:template>

        <xsl:template match="text()">
            <xsl:copy-of select="btest:bolder(.)/node()" />
        </xsl:template>         
     </xsl:stylesheet>
""")


glossary = ['Paris', 'Vincennes', 'France', 'Guy de Maupassant', 'Maurice Ravel']


def bolder(context, s):
    results = []
    r = None
    # - Iteration over the index and the words contained in the sequence
    for gw, word in itertools.zip_longest(glossary, s[0].split()):
        # - Little preprocessing for words with punctuation
        word_clean = word.translate(str.maketrans('', '', string.punctuation))
        # 1) If word in sequence directly match with glossary (index) add tag
        if word_clean in glossary:
            if r is not None:
                results.append(r)
            r = etree.Element('r')
            b = etree.SubElement(r, 'entity')
            b.text = word
            b.tail = ' '
            results.append(r)
            r = None

        # 2) Otherwise take the word from the index and check that it is contained in the sequence 
        # (if the word is composed eg. First name   last name 
        # and that it is not None)
        elif gw is not None and gw in s[0] and len(gw.split()) > 1:
            # repeat the process to annotate
            if r is not None:
                results.append(r)
            r = etree.Element('r')
            b = etree.SubElement(r, 'entity')
            b.text = gw
            b.tail = ' '
            results.append(r)
            r = None

        # 3) if none of the prerequisites, add text to output with no tag
        else:
            if r is None:
                r = etree.Element('r')
            r.text = '%s%s ' % (r.text or '', word)

        if r is not None:
            results.append(r)

    return results

def test():
    ns = etree.FunctionNamespace('uri:bolder') 
    ns['bolder'] = bolder 
    transform = etree.XSLT(stylesheet)
    new = str(transform(etree.XML("""<TEI><teiHeader></teiHeader><body><p>Paris est la capitale de France et Vincennes n'est pas loin.</p><p> Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à Paris.</p></body></TEI>""")))
    print(new)
    
if __name__ == "__main__":
    test()

But the output is still insufficient (repetition and omission) :

<?xml version="1.0"?>
<TEI><teiHeader/><body><p><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin. </p><p>Guy de Maupassant <entity>Guy de Maupassant</entity> <entity>Maurice Ravel</entity> Ravel inaugurent une nouvelle statue à <entity>Paris.</entity> </p></body></TEI>

How could I improve the above solution to correctly match mixed words which form only one entity? thanks in advance. Good day.

CodePudding user response:

Here is an example of how you could do it all within your XSLT stylesheet (assuming the libxslt processor):

XSLT 1.0 EXSLT str:tokenize()

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:str="http://exslt.org/strings"
extension-element-prefixes="str">
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>

<xsl:param name="glossary">Paris|Vincennes|France|Guy de Maupassant|Maurice Ravel</xsl:param>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="p/text()">
    <xsl:call-template name="tag-terms">
        <xsl:with-param name="string" select="."/>
        <xsl:with-param name="terms" select="str:tokenize($glossary, '|')"/>
    </xsl:call-template>
</xsl:template>

<xsl:template name="tag-terms">
    <xsl:param name="string"/>
    <xsl:param name="terms"/>
    <xsl:choose>
        <xsl:when test="$terms">
            <xsl:variable name="term" select="$terms[1]" />
            <xsl:choose>
                <xsl:when test="contains($string, $term)">
                    <!-- process substring-before with the remaining terms -->
                    <xsl:call-template name="tag-terms">
                        <xsl:with-param name="string" select="substring-before($string, $term)"/>
                        <xsl:with-param name="terms" select="$terms[position() > 1]"/>
                    </xsl:call-template>
                    <!-- matched term -->
                    <entity>
                        <xsl:value-of select="$term"/>
                    </entity>
                    <!-- continue with substring-after -->
                    <xsl:call-template name="tag-terms">
                        <xsl:with-param name="string" select="substring-after($string, $term)"/>
                        <xsl:with-param name="terms" select="$terms"/>
                    </xsl:call-template>
                </xsl:when>
                <xsl:otherwise>
                    <!-- pass the entire string for processing with the remaining terms -->
                    <xsl:call-template name="tag-terms">
                        <xsl:with-param name="string" select="$string"/>
                        <xsl:with-param name="terms" select="$terms[position() > 1]"/>
                    </xsl:call-template>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$string"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

</xsl:stylesheet>

Applied to your input example, this will return:

Result

<?xml version="1.0" encoding="utf-8"?>
<TEI>
  <teiHeader/>
  <body>
    <p><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin.</p>
    <p> <entity>Guy de Maupassant</entity> et <entity>Maurice Ravel</entity> inaugurent une nouvelle statue à <entity>Paris</entity>.</p>
  </body>
</TEI>

The glossary can be passed to the stylesheet at runtime as a delimited string.

CodePudding user response:

In XSLT 3 you could use analyze-string:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:fn="http://www.w3.org/2005/xpath-functions"
    expand-text="yes"
    exclude-result-prefixes="#all"
    version="3.0">
    
  <xsl:param name="entity-strings" as="xs:string*" select="'Paris', 'Vincennes', 'France', 'Guy de Maupassant', 'Maurice Ravel'"/>

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:template match="text()">
      <xsl:apply-templates select="analyze-string(., string-join($entity-strings, '|'))" mode="entity"/>
  </xsl:template>
  
  <xsl:template match="fn:match" mode="entity">
      <entity>{.}</entity>
  </xsl:template>
  
</xsl:stylesheet>

Saxon-C (any edition) from Saxonica has a Python API so it can be used with Python 3.

Or use a Python extension function to lxml as

from lxml import etree as ET
import re


stylesheet = ET.XML("""
    <xsl:stylesheet version="1.0"
         xmlns:btest="uri:bolder"
         xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

        <xsl:template match="@* | node()">
          <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
        </xsl:template>

        <xsl:template match="text()">
            <xsl:copy-of select="btest:bolder(.)/node()" />
        </xsl:template>         
     </xsl:stylesheet>
""")

def bolder(context, s):
    results = []
    splits = re.split(r'(Paris|Vincennes|France|Guy de Maupassant|Maurice Ravel)', s[0])
    for p, split in enumerate(splits):
      if p % 2 == 0:
        el = ET.Element("element")
        el.text = split
      else:
        el = ET.Element("element")
        entity = ET.SubElement(el, "entity")
        entity.text = split
      results.append(el)
    return results 
      
ns = ET.FunctionNamespace('uri:bolder') 
ns['bolder'] = bolder 
transform = ET.XSLT(stylesheet)
new = str(transform(ET.XML("""<TEI><teiHeader></teiHeader><body><p>Paris est la capitale de France et Vincennes n'est pas loin.</p><p> Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à Paris.</p></body></TEI>""")))
print(new)
  • Related