I have a requirement where I have extract XML with in CDATA with in XML. I am able to extract XML tags, but not XML tags in CDATA.
I need to extract
- EventId = 122157660 (I am able to do, good with this).
- _Type="Phone" _Value="5152083348" with in PAYLOAD/REQUEST_GROUP/REQUESTING_PARTY/CONTACT_DETAIL/CONTACT_POINT (need help with this.)
Below is the XML sample I am working with.
<B2B_DATA>
<B2B_METADATA>
<EventId>122157660</EventId>
<MessageType>Request</MessageType>
</B2B_METADATA>
<PAYLOAD>
<![CDATA[<?xml version="1.0"?>
<REQUEST_GROUP MISMOVersionID="1.1.1">
<REQUESTING_PARTY _Name="CityBank" _StreetAddress="801 Main St" _City="rockwall" _State="MD" _PostalCode="11311" _Identifier="416">
<CONTACT_DETAIL _Name="XX Davis">
<CONTACT_POINT _Type="Phone" _Value="1236573348"/>
<CONTACT_POINT _Type="Email" _Value="[email protected]"/>
</CONTACT_DETAIL>
</REQUESTING_PARTY>
</REQUEST_GROUP>]]>
</PAYLOAD>
</B2B_DATA>
I have tried this -
tree = ElementTree.parse('file.xml')
root = tree.getroot()
for child in root:
print(child.tag)
O/P B2B_METADATA PAYLOAD
Not able to parse inside PAYLOAD.
Any help is greatly appreciated.
CodePudding user response:
What you need to do, in this case, is parse the outer xml, extract the xml in the CDATA, parse that inner xml and extract the target data from that.
I personally would use lxml and xpath, not ElementTree:
from lxml import etree
root = etree.parse('file.xml')
#step one: extract the cdata as a string
cd = root.xpath('//PAYLOAD//text()')[0].strip()
#step 2 - parse the cdata string as xml
doc = etree.XML(cd)
#finally, extract the target data
doc.xpath('//REQUESTING_PARTY//CONTACT_POINT[@_Type="Phone"]/@_Value')[0]
Output, based on your sample xml above:
'1236573348'