I'm looking to annotate text in XML from an index that represented named entities :
glossary = ['Paris', 'Vincennes', 'France', 'Guy de Maupassant', 'Maurice Ravel']
I found a solution on Stack (by @Brad Clements) to partially annotate the XML here : insert tags in ElementTree text
for the XML : <TEI><teiHeader></teiHeader><body><p>Paris est la capitale de France et Vincennes n'est pas loin.</p><p> Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à Paris.</p></body></TEI>
It works for words shorter than 1 like "Paris", "Vincennes" or "France" but not for other words in the index.
My current output is : <?xml version="1.0"?> <TEI><teiHeader/><body><p><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin. </p><p>Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à <entity>Paris.</entity> </p></body></TEI>
Expected output: <?xml version="1.0"?> <TEI><teiHeader/><body><p><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin. </p><p><entity>Guy de Maupassant</entity> et <entity>Maurice Ravel</entity> inaugurent une nouvelle statue à <entity>Paris.</entity> </p></body></TEI>
The code I tried to adapt (taken from insert tags in ElementTree text):
from lxml import etree
import string
import itertools
stylesheet = etree.XML("""
<xsl:stylesheet version="1.0"
xmlns:btest="uri:bolder"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="@*">
<xsl:copy />
</xsl:template>
<xsl:template match="*">
<xsl:element name="{name(.)}">
<xsl:copy-of select="@*" />
<xsl:apply-templates select="text()" />
<xsl:apply-templates select="./*" />
</xsl:element>
</xsl:template>
<xsl:template match="text()">
<xsl:copy-of select="btest:bolder(.)/node()" />
</xsl:template>
</xsl:stylesheet>
""")
glossary = ['Paris', 'Vincennes', 'France', 'Guy de Maupassant', 'Maurice Ravel']
def bolder(context, s):
results = []
r = None
# - Iteration over the index and the words contained in the sequence
for gw, word in itertools.zip_longest(glossary, s[0].split()):
# - Little preprocessing for words with punctuation
word_clean = word.translate(str.maketrans('', '', string.punctuation))
# 1) If word in sequence directly match with glossary (index) add tag
if word_clean in glossary:
if r is not None:
results.append(r)
r = etree.Element('r')
b = etree.SubElement(r, 'entity')
b.text = word
b.tail = ' '
results.append(r)
r = None
# 2) Otherwise take the word from the index and check that it is contained in the sequence
# (if the word is composed eg. First name last name
# and that it is not None)
elif gw is not None and gw in s[0] and len(gw.split()) > 1:
# repeat the process to annotate
if r is not None:
results.append(r)
r = etree.Element('r')
b = etree.SubElement(r, 'entity')
b.text = gw
b.tail = ' '
results.append(r)
r = None
# 3) if none of the prerequisites, add text to output with no tag
else:
if r is None:
r = etree.Element('r')
r.text = '%s%s ' % (r.text or '', word)
if r is not None:
results.append(r)
return results
def test():
ns = etree.FunctionNamespace('uri:bolder')
ns['bolder'] = bolder
transform = etree.XSLT(stylesheet)
new = str(transform(etree.XML("""<TEI><teiHeader></teiHeader><body><p>Paris est la capitale de France et Vincennes n'est pas loin.</p><p> Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à Paris.</p></body></TEI>""")))
print(new)
if __name__ == "__main__":
test()
But the output is still insufficient (repetition and omission) :
<?xml version="1.0"?>
<TEI><teiHeader/><body><p><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin. </p><p>Guy de Maupassant <entity>Guy de Maupassant</entity> <entity>Maurice Ravel</entity> Ravel inaugurent une nouvelle statue à <entity>Paris.</entity> </p></body></TEI>
How could I improve the above solution to correctly match mixed words which form only one entity? thanks in advance. Good day.
CodePudding user response:
Here is an example of how you could do it all within your XSLT stylesheet (assuming the libxslt
processor):
XSLT 1.0 EXSLT str:tokenize()
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:str="http://exslt.org/strings"
extension-element-prefixes="str">
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>
<xsl:param name="glossary">Paris|Vincennes|France|Guy de Maupassant|Maurice Ravel</xsl:param>
<!-- identity transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p/text()">
<xsl:call-template name="tag-terms">
<xsl:with-param name="string" select="."/>
<xsl:with-param name="terms" select="str:tokenize($glossary, '|')"/>
</xsl:call-template>
</xsl:template>
<xsl:template name="tag-terms">
<xsl:param name="string"/>
<xsl:param name="terms"/>
<xsl:choose>
<xsl:when test="$terms">
<xsl:variable name="term" select="$terms[1]" />
<xsl:choose>
<xsl:when test="contains($string, $term)">
<!-- process substring-before with the remaining terms -->
<xsl:call-template name="tag-terms">
<xsl:with-param name="string" select="substring-before($string, $term)"/>
<xsl:with-param name="terms" select="$terms[position() > 1]"/>
</xsl:call-template>
<!-- matched term -->
<entity>
<xsl:value-of select="$term"/>
</entity>
<!-- continue with substring-after -->
<xsl:call-template name="tag-terms">
<xsl:with-param name="string" select="substring-after($string, $term)"/>
<xsl:with-param name="terms" select="$terms"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<!-- pass the entire string for processing with the remaining terms -->
<xsl:call-template name="tag-terms">
<xsl:with-param name="string" select="$string"/>
<xsl:with-param name="terms" select="$terms[position() > 1]"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$string"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Applied to your input example, this will return:
Result
<?xml version="1.0" encoding="utf-8"?>
<TEI>
<teiHeader/>
<body>
<p><entity>Paris</entity> est la capitale de <entity>France</entity> et <entity>Vincennes</entity> n'est pas loin.</p>
<p> <entity>Guy de Maupassant</entity> et <entity>Maurice Ravel</entity> inaugurent une nouvelle statue à <entity>Paris</entity>.</p>
</body>
</TEI>
The glossary can be passed to the stylesheet at runtime as a delimited string.
CodePudding user response:
In XSLT 3 you could use analyze-string
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:param name="entity-strings" as="xs:string*" select="'Paris', 'Vincennes', 'France', 'Guy de Maupassant', 'Maurice Ravel'"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="text()">
<xsl:apply-templates select="analyze-string(., string-join($entity-strings, '|'))" mode="entity"/>
</xsl:template>
<xsl:template match="fn:match" mode="entity">
<entity>{.}</entity>
</xsl:template>
</xsl:stylesheet>
Saxon-C (any edition) from Saxonica has a Python API so it can be used with Python 3.
Or use a Python extension function to lxml as
from lxml import etree as ET
import re
stylesheet = ET.XML("""
<xsl:stylesheet version="1.0"
xmlns:btest="uri:bolder"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:copy-of select="btest:bolder(.)/node()" />
</xsl:template>
</xsl:stylesheet>
""")
def bolder(context, s):
results = []
splits = re.split(r'(Paris|Vincennes|France|Guy de Maupassant|Maurice Ravel)', s[0])
for p, split in enumerate(splits):
if p % 2 == 0:
el = ET.Element("element")
el.text = split
else:
el = ET.Element("element")
entity = ET.SubElement(el, "entity")
entity.text = split
results.append(el)
return results
ns = ET.FunctionNamespace('uri:bolder')
ns['bolder'] = bolder
transform = ET.XSLT(stylesheet)
new = str(transform(ET.XML("""<TEI><teiHeader></teiHeader><body><p>Paris est la capitale de France et Vincennes n'est pas loin.</p><p> Guy de Maupassant et Maurice Ravel inaugurent une nouvelle statue à Paris.</p></body></TEI>""")))
print(new)