I'm working with an xml corpus that looks like this:
<corpus>
<dialogue speaker="A">
<sentence tag1="attribute1" tag2="attribute2"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="different_attribute1" tag2= "different_attribute2"> How are you </sentence>
</dialogue>
</corpus>
I use root.findall()
to search for all instances of "different_attribute2", but then I would like to print not only the parent element that contains the attribute but also the element that comes before that:
{'speaker': 'A'}
Hello
{'speaker':'B'}
How are you
I'm quite new at coding, so I've tried a bunch of for loops and if statements without result. I start with:
for words in root.findall('.//sentence[@tag2="different_attribute2"]'):
for speaker in root.findall('.//sentence[@tag2="different_attribute2"]...'):
print(speaker.attrib)
print(words.text)
But then I have absolutely no idea on how to retrieve Speaker A. Can anyone help me?
CodePudding user response:
Using lxml
and with a single xpath to find all elements:
>>> from lxml import etree
>>> tree = etree.parse('/home/lmc/tmp/test.xml')
>>> for e in tree.xpath('//sentence[@tag2="different_attribute2"]/parent::dialogue/@speaker | //sentence[@tag2="different_attribute2"]/text() | //dialogue[following-sibling::dialogue/sentence[@tag2="different_attribute2"]]/sentence/text() | //dialogue[following-sibling::dialogue/sentence[@tag2="different_attribute2"]]/@speaker'):
... print(e)
...
A
Hello
B
How are you
Xpath details
Find speaker B
//sentence[@tag2="different_attribute2"]/parent::dialogue/@speaker
Find sentence
of B
//sentence[@tag2="different_attribute2"]/text()
Find sentence
of A given B
//dialogue[following-sibling::dialogue/sentence[@tag2="different_attribute2"]]/sentence/text()
Find speaker=A
given B
//dialogue[following-sibling::dialogue/sentence[@tag2="different_attribute2"]]/@speaker'