I have to process XML documents with structures like the following:
<tok xpos="ABCDE">node1</tok> <tok xpos="XYZ">node2</tok>
<tok xpos="ABCDE">node6</tok> <tok xpos="RST">node7</tok>
<tok xpos="ABTSV">node8</tok> <tok xpos="XYU">node9</tok>
<tok xpos="ABTSV">node14</tok> <tok xpos="XBZ">node15</tok>
I need to change part of the value for the attribute 'xpos' just in case the value for the same attribute in the following element starts with a specific sequence of characters. So, in this example, I would need to replace the first two characters of the value for any 'xpos' attribute starting with 'AB' with a new value that will replace 'AB' with 'XX' and will preserve the rest of the characters in the string just in case a specific condition is met. This condition being that the value of the 'xpos' attribute in the following element starts with the sequence of characters 'XY'.
So, after processing, the output would have to be:
<tok xpos="XXCDE">node1</tok> <tok xpos="XYZ">node2</tok>
<tok xpos="ABCDE">node6</tok> <tok xpos="RST">node7</tok>
<tok xpos="XXTSV">node8</tok> <tok xpos="XYU">node9</tok>
<tok xpos="ABTSV">node14</tok> <tok xpos="XBZ">node15</tok>
I have tried to do this with the following code. You will notice that I have attempted to use backreferencing by using parentheses for the two substrings in which I divide the values of the affected attributes and then referencing the second capturing group with \2.
source = """
<root>
<tok xpos="ABCDE">node1</tok> <tok xpos="XYZ">node2</tok>
<tok xpos="ABCDE">node6</tok> <tok xpos="RST">node7</tok>
<tok xpos="ABTSV">node8</tok> <tok xpos="XYU">node9</tok>
<tok xpos="ABTSV">node14</tok> <tok xpos="XBZ">node15</tok>
</root>
"""
import lxml.etree
root_element = lxml.etree.XML(source)
for el in root_element.xpath('//tok[starts-with(@xpos, "XY")]/preceding-sibling::tok[1][re:match(@xpos, "^(AB)([A-Z] )")]',
namespaces={"re": "http://exslt.org/regular-expressions"}):
el.set('xpos', 'XX\2')
This does not work. I get the following error message:
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
How do I go about achieving this goal? In principle, according to this https://www.regular-expressions.info/xpath.html XPath supports backreferences and capturing groups. I just don't know how I should go about implementing it. What am I doing wrong?
JM
CodePudding user response:
I would use XSLT 1.0 with EXSLT support in Python:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:re="http://exslt.org/regular-expressions"
exclude-result-prefixes="re"
version="1.0">
<xsl:param name="pattern">^(AB)([A-Z] )</xsl:param>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="tok[re:match(@xpos, '^(AB)([A-Z] )')][following-sibling::tok[1][starts-with(@xpos, 'XY')]]/@xpos">
<xsl:attribute name="{name()}">
<xsl:value-of select="re:replace(., $pattern, '', 'XX$2')"/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
CodePudding user response:
Note python's lxml
supports regexes through python's implementation, which is different from the XPath flavor of regexes (which is what is described on Regular-Expressions.info).
Though backslashes are used in some regex engines as backreferences, python will first interpret '\2'
in a string as an escape sequence, specifically for character code 2, the start-of-text control character. Either the backslash should be escaped ('\\2'
), or the string should be made raw (r'\2'
).
However, neither will address the ultimate issue. Firstly, not only are backreferences not stored across calls, they aren't stored between regexes; they are only valid within their own regex. Secondly, lxml.etree._Element.set
isn't regex aware.
In this case, regexes aren't needed; the requirements are simple enough to implement without them. You can use lxml.etree._Element.get
to get the attribute value, then create the new value in Python.
for el in root_element.xpath('//tok[starts-with(@xpos, "XY")]/preceding-sibling::tok[1][starts-with(@xpos, "AB")]'):
el.set('xpos', 'XX' el.get('xpos')[2:])