Home > other >  Extract xml data with in cdata using Python
Extract xml data with in cdata using Python

Time:07-26

I have a requirement where I have extract XML with in CDATA with in XML. I am able to extract XML tags, but not XML tags in CDATA.

I need to extract

  1. EventId = 122157660 (I am able to do, good with this).
  2. _Type="Phone" _Value="5152083348" with in PAYLOAD/REQUEST_GROUP/REQUESTING_PARTY/CONTACT_DETAIL/CONTACT_POINT (need help with this.)

Below is the XML sample I am working with.

<B2B_DATA>
   <B2B_METADATA>
       <EventId>122157660</EventId>
       <MessageType>Request</MessageType>
   </B2B_METADATA>
<PAYLOAD>
    <![CDATA[<?xml version="1.0"?>
        <REQUEST_GROUP MISMOVersionID="1.1.1">
            <REQUESTING_PARTY _Name="CityBank" _StreetAddress="801 Main St" _City="rockwall" _State="MD" _PostalCode="11311" _Identifier="416">
                <CONTACT_DETAIL _Name="XX Davis">
                    <CONTACT_POINT _Type="Phone" _Value="1236573348"/>
                    <CONTACT_POINT _Type="Email" _Value="[email protected]"/>
                </CONTACT_DETAIL>
            </REQUESTING_PARTY>
        </REQUEST_GROUP>]]>
</PAYLOAD>
</B2B_DATA>

I have tried this -

tree = ElementTree.parse('file.xml')
root = tree.getroot()
for child in root:
    print(child.tag)

O/P B2B_METADATA PAYLOAD

Not able to parse inside PAYLOAD.

Any help is greatly appreciated.

CodePudding user response:

What you need to do, in this case, is parse the outer xml, extract the xml in the CDATA, parse that inner xml and extract the target data from that.

I personally would use lxml and xpath, not ElementTree:

from lxml import etree
root = etree.parse('file.xml')

#step one: extract the cdata as a string
cd = root.xpath('//PAYLOAD//text()')[0].strip()

#step 2 - parse the  cdata string as xml
doc = etree.XML(cd)

#finally, extract the target data
doc.xpath('//REQUESTING_PARTY//CONTACT_POINT[@_Type="Phone"]/@_Value')[0]

Output, based on your sample xml above:

'1236573348'
  • Related