I have to process a bunch of XML documents that have structures similar to the following:
<tok xpos="ABCDE">node1</tok> <tok xpos="XYZ">node2</tok>
<tok xpos="ABCDE">node6</tok> <tok xpos="RST">node7</tok>
<tok xpos="ABTSV">node8</tok> <tok xpos="XYU">node9</tok>
<tok xpos="ABTSV">node14</tok> <tok xpos="XBZ">node15</tok>
So, what I need to do is to change part of the value for the attribute 'xpos' just in case the value for the same attribute in the following element starts with a specific sequence of characters. So, in this example, I would need to replace the first two characters of the value for any 'xpos' attribute starting with 'AB' with a new value that will replace 'AB' with 'XX' and will preserve the rest of the characters in the string just in case a specific condition is met. This condition being that the value of the 'xpos' attribute in the following element starts with the sequence of characters 'XY'.
So, after processing, the output would have to be:
<tok xpos="XXCDE">node1</tok> <tok xpos="XYZ">node2</tok>
<tok xpos="ABCDE">node6</tok> <tok xpos="RST">node7</tok>
<tok xpos="XXTSV">node8</tok> <tok xpos="XYU">node9</tok>
<tok xpos="ABTSV">node14</tok> <tok xpos="XBZ">node15</tok>
Following the suggestions of a helpful stackoverflow contributor I tried to use XSLT with the following code:
# coding: utf-8
import os
import lxml.etree as et
import time
ROOT = '/Users/somepath'
ext = ('.xml')
def xml_change(root_element):
et.XSLT(et.XML('''
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:re="http://exslt.org/regular-expressions"
exclude-result-prefixes="re"
version="1.0">
<xsl:param name="pattern">^(AB)([A-Z] )</xsl:param>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="tok[re:match(@xpos, '^(AB)([A-Z] )')][following-sibling::tok[1][starts-with(@xpos, 'XY')]]/@xpos">
<xsl:attribute name="{name()}">
<xsl:value-of select="re:replace(., $pattern, '', 'XX$2')"/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''))
for root, dirs, files in os.walk(ROOT):
for file in files:
if file.endswith(ext):
file_path = os.path.join(ROOT, file)
# load root element from file
file_root = et.parse(file_path).getroot()
# init tree object from file_root
xml_change(file_root)
tree = et.ElementTree(file_root)
# save cleaned xml tree object to file. Important to specify encoding
tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', pretty_print=True, doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)
This doesn't seem to work but I don't know exactly why. The documents appear to be processed and I get the copies of the files with the added '-clean' in their names. I don't get any error messages but nothing seems to have changed inside the documents.
EDIT: After reading the answers, I changed the relevant parts of my code to:
xslt_job=et.XML('''
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="tok[starts-with(@xpos, 'AB')][starts-with(following-sibling::tok[1]/@xpos, 'XY')]/@xpos">
<xsl:attribute name="xpos">
<xsl:text>XX</xsl:text>
<xsl:value-of select="substring(., 3)"/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
''')
transform = et.XSLT(xslt_job)
# iterate all dirs
for root, dirs, files in os.walk(ROOT):
if file.endswith(ext):
file_path = os.path.join(ROOT, file)
file_root = et.parse(file_path).getroot()
result_tree = transform(file_root)
result_tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)
I'm happy because it seems that this is finally working (i.e. the newly created files contain the desired transformations). However, there is one problem: the structure of the original documents disappears in the output XML documents which contain all a single line. I know there is a way to fix that by adding the parameter "pretty_print=True" to the write command but this won't work for me. This parameter renders well formed XML docs but the documents need to preserve exactly the structure of the originals. Is there any way to achieve this?
CodePudding user response:
Move the parsing of the stylesheet outside and before the function e.g.
transform = et.XSLT(et.XML('''
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:re="http://exslt.org/regular-expressions"
exclude-result-prefixes="re"
version="1.0">
<xsl:param name="pattern">^(AB)([A-Z] )</xsl:param>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="tok[re:match(@xpos, '^(AB)([A-Z] )')][following-sibling::tok[1][starts-with(@xpos, 'XY')]]/@xpos">
<xsl:attribute name="{name()}">
<xsl:value-of select="re:replace(., $pattern, '', 'XX$2')"/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''))
Then use e.g.
def xml_change(root_element):
return transform(root_element)
and e.g.
result_tree = xml_change(file_root)
then to save that use the documented write_output e.g.
result_tree.write_output(ile_path.replace('.xml'))
Make sure your XSLT sets any wanted encoding and DOCTYPE/system id with e.g. <xsl:output method="xml" encoding="UTF-8" doctype-system="estcorpus.dtd"/>
CodePudding user response:
Your stylesheet will likely return an error because the libxslt processor does not support the EXSLT Regular Expressions extension functions.
The stated task can be actually accomplished quite easily without any extension functions. Given a well-formed input such as:
XML
<root>
<tok xpos="ABCDE">node1</tok>
<tok xpos="XYZ">node2</tok>
<tok xpos="ABCDE">node6</tok>
<tok xpos="RST">node7</tok>
<tok xpos="ABTSV">node8</tok>
<tok xpos="XYU">node9</tok>
<tok xpos="ABTSV">node14</tok>
<tok xpos="XBZ">node15</tok>
</root>
the following stylesheet:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="tok[starts-with(@xpos, 'AB')][starts-with(following-sibling::tok[1]/@xpos, 'XY')]/@xpos">
<xsl:attribute name="xpos">
<xsl:text>XX</xsl:text>
<xsl:value-of select="substring(., 3)"/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
will return:
Result
<?xml version="1.0" encoding="UTF-8"?>
<root>
<tok xpos="XXCDE">node1</tok>
<tok xpos="XYZ">node2</tok>
<tok xpos="ABCDE">node6</tok>
<tok xpos="RST">node7</tok>
<tok xpos="XXTSV">node8</tok>
<tok xpos="XYU">node9</tok>
<tok xpos="ABTSV">node14</tok>
<tok xpos="XBZ">node15</tok>
</root>