A snippet of my document and the code is as follows:
import xml.etree.ElementTree as ET
obj = ET.fromstring("""
<tab>
<infos><bounds left="7947" top="88607" width="10086" height="1184" bottom="89790" right="18032" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/> <prtBounds left="115" top="0" width="9300" height="1169" bottom="1168" right="9414"/> </infos>
<row > <infos> <bounds left="8062" top="88607" width="9300" height="524" bottom="89130" right="17361" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/> <prtBounds left="0" top="0" width="9300" height="524" bottom="523" right="9299"/> </infos>
<cell ptr="000002232E644270" id="199" symbol="class SwCellFrame" next="202" upper="198" lower="200" rowspan="1"> <infos> <bounds left="8062" top="88607" width="546" height="524" bottom="89130" right="8607" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/> <prtBounds left="7" top="15" width="532" height="509" bottom="523" right="538"/> </infos>
<txt> <infos> <bounds left="8069" top="88622" width="532" height="187" bottom="88808" right="8600" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="0" top="3" width="532" height="184" bottom="186" right="531"/> </infos>
<Finish/>
</txt>
<txt> <infos> <bounds left="8069" top="88809" width="532" height="149" bottom="88957" right="8600" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="136" top="0" width="396" height="149" bottom="148" right="531"/> </infos>
UDA <Finish/>
</txt>
</cell>
<cell ptr="000002232E642E40" id="202" symbol="class SwCellFrame" next="205" prev="199" upper="198" lower="203" rowspan="1"> <infos> <bounds left="8608" top="88607" width="3283" height="524" bottom="89130" right="11890" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/> <prtBounds left="7" top="15" width="3269" height="509" bottom="523" right="3275"/> </infos>
<txt>
<infos> <bounds left="8615" top="88622" width="3269" height="180" bottom="88801" right="11883" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="0" top="7" width="3269" height="173" bottom="179" right="3268"/> </infos> <Finish/>
</txt>
<txt> <infos> <bounds left="8615" top="88802" width="3269" height="149" bottom="88950" right="11883" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="58" top="0" width="3170" height="149" bottom="148" right="3227"/> </infos>
Nombre <Finish/>
</txt>
</cell>
</row>
</tab>
""")
a = obj.findall('./row/cell/txt')
for i, item in enumerate(a):
print(i, item.text.strip())
But if I simplify the document, I do manage to extract the text,
obj = ET.fromstring("""
<tab>
<row>
<cell >
<txt > <Finish/> </txt>
<txt > UDA <Finish/> </txt>
</cell>
<cell >
<txt > <Finish/> </txt>
<txt > Nombre <Finish/> </txt>
</cell>
</row>
</tab>
""")
a = obj.findall('./row/cell/txt')
for i, item in enumerate(a):
print(i, item.text.strip())
0
1 UDA
2
3 Nombre
I don't know how to solve this problem, because my working document is very large and I can't simplify it as I have done in this example.
CodePudding user response:
The "UDA" and "Nombre" strings are found in the tail
of infos
elements. The easiest way to get the wanted output is to use itertext()
:
a = obj.findall('./row/cell/txt')
for i, item in enumerate(a):
text = "".join([s.strip() for s in item.itertext()])
print(i, text)