I have to process XML files with structures such as the following:
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT1</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT3</tok>
Up to now I'm using the following code to change the attributes of certain tags within some element whenever a particular condition is met. For instance, I want to change the attributes of the tags 'tag1' and 'tag2' only in the 'tok' elements where the tag 'tag1' has the attribute 'blah1', this does the job:
def xml_change(root_element):
for el in root.xpath('//tok'):
if el.get('tag1') == "blah1":
el.set('tag1', 'Blah1-TEXT1')
el.set('tag2', 'Blah2-TEXT1')
it returns:
<tok tag1="Blah1-TEXT1" tag2="Blah2-TEXT1" tag3="blah3">TEXT1</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT3</tok>
What I need to do next, though, is a bit more complicated and I'm totally stumped. Let me try to describe the problem to see if you can point me to a satisfactory solution.
In some cases I need to change the attributes of certain tags in the 'tok' element only if the tags within elements preceding this element or within elements following it have certain attributes. So, say I have the following XML:
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT1</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT3</tok>
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT4</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT5</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT6</tok>
What would I have to change in my code to modify all the 'tag1' attributes to, say, "newattrib", only in cases where the 'tag1' attribute of the previous element is "blah1" and the 'tag2' attribute in the following element is "blahY". So, using the previous example of XML doc, this would have to affect only the element with text 'TEXT5' and would have to return:
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT1</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT3</tok>
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT4</tok>
<tok tag1="newattrib" tag2="blahB" tag3="blahC">TEXT5</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT6</tok>
In essence, what I don't know how to do is how to specify the context for the elements that I want to modify.
CodePudding user response:
You'll have to use a somewhat complicated xpath expression involving preceding and following siblings, but it's doable. Try something like this:
from lxml import etree
blahs ="""<root>
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT1</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT3</tok>
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT4</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT5</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT6</tok>
</root>"""
doc = etree.fromstring(blahs)
for el in doc.xpath('//tok[preceding-sibling::tok[1][@tag1="blah1"]][following-sibling::tok[1][@tag2="blahY"]]'):
el.set('tag1', 'newattrib')
print(etree.tostring(doc).decode())
The output should be your expected output.
Depending on the actual structure, you may be able to drop the [1]
s in the expression.