I wrote a python program with lxml library to parse a xml file using its xpath. The value and xpath are all correct but it returns many '\n' and white spaces just like the xml file's formatting.
here is my code:
from lxml import etree
from xml.dom import minidom
#data = minidom.parse('D:/LocalSpark/bitmap.xml')
sigxml = etree.parse('D:/LocalSpark/bitmap.xml',etree.XMLParser(remove_blank_text=True, load_dtd=True))
xpath = '/OneMessage[@Name="NR RRCReconfiguration"]/BalongMessage/Content/L3MessageContent/DL-DCCH-Message/message/c1/rrcReconfiguration/criticalExtensions/rrcReconfiguration/measConfig/measObjectToAddModList/MeasObjectToAddMod/measObject/measObjectNR/referenceSignalConfig/ssb-ConfigMobility/ssb-ToMeasure/setup/mediumBitmap'
info = 10000000
for node in sigxml.xpath(xpath):
print('node: ', node)
print('node.tag: ',node.tag)
print('node.text:',node.text)
print('node.item:',node.items())
print('node.attrib:',node.attrib)
if info == node.text:
print("%s info do exist!"%info)
else:
print("%s info do not exist!!!"%info)
here is the xml file:
<OneMessage Name="NR RRCReconfiguration" MsgTimeStamp="1668594368290"><BalongMessage><Header><usRsvd>4608</usRsvd><ucbMdmId>0</ucbMdmId><ucbMsgType>3</ucbMsgType><ucbRsvd>0</ucbRsvd><ulMsgClsID>26080000</ulMsgClsID><ullbTimeStamp>1853637.763054</ullbTimeStamp><ullbCpuTransID>38693</ullbCpuTransID><usSocpTransID>20388</usSocpTransID><ullLocalTime>133129368818699187</ullLocalTime><ulTransNo>6107</ulTransNo><ulSendPID>131072</ulSendPID><ulRecvPID>0</ulRecvPID><ulPrimID>00000003</ulPrimID><ucbOtaDirect>DL(1)</ucbOtaDirect><ucbPrintLevel>63</ucbPrintLevel><ulDataSize>56</ulDataSize></Header><Content><L3MessageContent><DL-DCCH-Message>
<message>
<c1>
<rrcReconfiguration>
<criticalExtensions>
<rrcReconfiguration>
<measConfig>
<measObjectToAddModList>
<MeasObjectToAddMod>
<measObject>
<measObjectNR>
<referenceSignalConfig>
<ssb-ConfigMobility>
<ssb-ToMeasure>
<setup>
<mediumBitmap>
10000000
</mediumBitmap>
</setup>
</ssb-ToMeasure>
</ssb-ConfigMobility>
</referenceSignalConfig>
</measObjectNR>
</measObject>
</MeasObjectToAddMod>
</measObjectToAddModList>
</measConfig>
</rrcReconfiguration>
</criticalExtensions>
</rrcReconfiguration>
</c1>
</message>
</DL-DCCH-Message>
</L3MessageContent></Content></BalongMessage></OneMessage>
Here is the result:
node: <Element mediumBitmap at 0x22e3c645f80>
node.tag: mediumBitmap
node.text:
10000000
node.item: []
node.attrib: {}
10000000 info do not exist!!!
My problem is that clearly the code can read and find mediumBitmap this element but as it shows in xml file, it has \n before and after it. So when the program goes on, it returns that mediumBitmap's text value is
\n 10000000 \n
but not just 10000000
It is a standard xml from a project so I can't edit it.
I tried to add remove_blank_text=True
to parse or using minidom
all failed
CodePudding user response:
There are many ways to strip spaces and newlines, however, a simple technique would be to use regex to remove them.
The critical line is this one:
int(re.sub(r'[\\n\s]*', '', node.text))
Which searches and substitutes all carriage returns and spaces in node.text
and converts them to ''
nothing. Then cast to int
so that the info
variable matches accordingly.
Here is the code:
from lxml import etree
from xml.dom import minidom
import re
#data = minidom.parse('D:/LocalSpark/bitmap.xml')
sigxml = etree.parse('D:/LocalSpark/bitmap.xml',etree.XMLParser(remove_blank_text=True, load_dtd=True))
xpath = '/OneMessage[@Name="NR RRCReconfiguration"]/BalongMessage/Content/L3MessageContent/DL-DCCH-Message/message/c1/rrcReconfiguration/criticalExtensions/rrcReconfiguration/measConfig/measObjectToAddModList/MeasObjectToAddMod/measObject/measObjectNR/referenceSignalConfig/ssb-ConfigMobility/ssb-ToMeasure/setup/mediumBitmap'
info = 10000000
for node in sigxml.xpath(xpath):
print('node: ', node)
print('node.tag: ',node.tag)
print('node.text:',node.text)
print('node.item:',node.items())
print('node.attrib:',node.attrib)
if info == int(re.sub(r'[\\n\s]*', '', node.text)):
print("%s info do exist!"%info)
else:
print("%s info do not exist!!!"%info)