I have a pretty big XML file that looks like this:
<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2= "dd"> How are you </sentence>
<sentence tag1="ff" tag2= "e"> today </sentence>
</dialogue>
<dialogue speaker="A">
<sentence tag1="d" tag2= "bbb"> Great </sentence>
<sentence tag1="f" tag2= "dd"> How about you </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2= "dd"> me too </sentence>
</dialogue>
</corpus>
and I need to remove the subelement tags, so the fragmented text becomes whole again and under the parent, for an output that looks like this:
<corpus>
<dialogue speaker="A">
Hello
</dialogue>
<dialogue speaker="B">
How are you today
</dialogue>
<dialogue speaker="A">
Great How about you
</dialogue>
<dialogue speaker="B">
me too
</dialogue>
</corpus>
I've tried element.strip()
and element.tag.strip()
but there is no output... this is my code:
f = ET.parse("file.xml")
root = f.getroot()
for s in root.findall("sentence"):
text = s.tag.strip("sentence")
print(text)
What am I doing wrong? Thank you all for your help!!
CodePudding user response:
You're almost there. To get your output, try:
for d in root.findall(".//dialogue"):
for s in d.findall('.//sentence'):
if s.text:
new_t = s.text.strip()
d.remove(s)
d.text=new_t
print(ET.tostring(root).decode())
And that should output what you need.