Home > Software engineering >  XML : remove tag but keep text
XML : remove tag but keep text

Time:05-20

I have a pretty big XML file that looks like this:

<corpus>
  <dialogue speaker="A">
    <sentence tag1="a" tag2="b"> Hello </sentence>
  </dialogue>
  <dialogue speaker="B">
    <sentence tag1="cc" tag2= "dd"> How are you </sentence>
    <sentence tag1="ff" tag2= "e"> today </sentence>
  </dialogue>
  <dialogue speaker="A">
    <sentence tag1="d" tag2= "bbb"> Great </sentence>
    <sentence tag1="f" tag2= "dd"> How about you </sentence>
  </dialogue>
  <dialogue speaker="B">
    <sentence tag1="a" tag2= "dd"> me too </sentence>
  </dialogue>
</corpus>

and I need to remove the subelement tags, so the fragmented text becomes whole again and under the parent, for an output that looks like this:

<corpus>
  <dialogue speaker="A">
    Hello
  </dialogue>
  <dialogue speaker="B">
    How are you today
  </dialogue>
  <dialogue speaker="A">
    Great How about you
  </dialogue>
  <dialogue speaker="B">
     me too
  </dialogue>
</corpus>

I've tried element.strip() and element.tag.strip() but there is no output... this is my code:

f = ET.parse("file.xml")
root = f.getroot()

for s in root.findall("sentence"):
    text = s.tag.strip("sentence")
    print(text)

What am I doing wrong? Thank you all for your help!!

CodePudding user response:

You're almost there. To get your output, try:

for d in root.findall(".//dialogue"):
        for s in d.findall('.//sentence'):
            if s.text:          
                new_t = s.text.strip()
            d.remove(s)
            d.text=new_t
print(ET.tostring(root).decode())

And that should output what you need.

  • Related