Home > Software design >  I cannot extract the text from an element using ElementTree
I cannot extract the text from an element using ElementTree

Time:08-20

A snippet of my document and the code is as follows:

import xml.etree.ElementTree as ET
obj = ET.fromstring("""
   <tab>
    <infos><bounds left="7947" top="88607" width="10086" height="1184" bottom="89790" right="18032" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/>     <prtBounds left="115" top="0" width="9300" height="1169" bottom="1168" right="9414"/> </infos>
    <row > <infos> <bounds left="8062" top="88607" width="9300" height="524" bottom="89130" right="17361" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/>      <prtBounds left="0" top="0" width="9300" height="524" bottom="523" right="9299"/>      </infos>
     <cell ptr="000002232E644270" id="199" symbol="class SwCellFrame" next="202" upper="198" lower="200" rowspan="1"> <infos> <bounds left="8062" top="88607" width="546" height="524" bottom="89130" right="8607" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/>        <prtBounds left="7" top="15" width="532" height="509" bottom="523" right="538"/>  </infos>
      <txt> <infos> <bounds left="8069" top="88622" width="532" height="187" bottom="88808" right="8600" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="0" top="3" width="532" height="184" bottom="186" right="531"/>        </infos>
       <Finish/>
      </txt>
      <txt> <infos> <bounds left="8069" top="88809" width="532" height="149" bottom="88957" right="8600" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="136" top="0" width="396" height="149" bottom="148" right="531"/> </infos>
UDA       <Finish/>
      </txt>
     </cell>
     <cell ptr="000002232E642E40" id="202" symbol="class SwCellFrame" next="205" prev="199" upper="198" lower="203" rowspan="1"> <infos> <bounds left="8608" top="88607" width="3283" height="524" bottom="89130" right="11890" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/> <prtBounds left="7" top="15" width="3269" height="509" bottom="523" right="3275"/> </infos>
      <txt>
       <infos> <bounds left="8615" top="88622" width="3269" height="180" bottom="88801" right="11883" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="0" top="7" width="3269" height="173" bottom="179" right="3268"/> </infos> <Finish/>
      </txt>
      <txt> <infos> <bounds left="8615" top="88802" width="3269" height="149" bottom="88950" right="11883" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="58" top="0" width="3170" height="149" bottom="148" right="3227"/> </infos>
Nombre       <Finish/>
      </txt>
     </cell>
    </row>
  </tab>
""")
a = obj.findall('./row/cell/txt')
for i, item in enumerate(a):
    print(i, item.text.strip())

But if I simplify the document, I do manage to extract the text,

obj = ET.fromstring("""
   <tab>
    <row>
     <cell > 
      <txt > <Finish/> </txt>
      <txt > UDA <Finish/> </txt>
     </cell>
     <cell >
      <txt > <Finish/> </txt>
      <txt > Nombre       <Finish/> </txt>
     </cell>
   </row>
  </tab>
""")

a = obj.findall('./row/cell/txt')
for i, item in enumerate(a):
    print(i, item.text.strip())
0 
1 UDA
2 
3 Nombre

I don't know how to solve this problem, because my working document is very large and I can't simplify it as I have done in this example.

CodePudding user response:

The "UDA" and "Nombre" strings are found in the tail of infos elements. The easiest way to get the wanted output is to use itertext():

a = obj.findall('./row/cell/txt')
for i, item in enumerate(a):
    text = "".join([s.strip() for s in item.itertext()])
    print(i, text)
  • Related