Home > Software design >  Python - Construct DF From Nested XML Response
Python - Construct DF From Nested XML Response

Time:11-08

What would be the best way to construct a DF from the below nested XML data?

Each "properties" element has three "property" elements nested containing the "name" and "value" of our data. I tried doing two for loops, pandas read_xml option, and a few other pieces but haven't quite gotten the nested logic figured out. My current approach below is closer, but does not keep the names and values together.

Using Python 3.7 in Jupyter on windows

Sample XML Data:

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
    xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
    xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing"
    xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"
    xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
    <env:Header
        xmlns:env="http://www.w3.org/2003/05/soap-envelope">
        <wsa:Action>RetrieveResponse</wsa:Action>
        <wsa:MessageID>urn:uuid:1234</wsa:MessageID>
        <wsa:RelatesTo>urn:uuid:1234</wsa:RelatesTo>
        <wsa:To>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To>
        <wsse:Security>
            <wsu:Timestamp wsu:Id="Timestamp-45333">
                <wsu:Created>2022-11-07T17:02:44Z</wsu:Created>
                <wsu:Expires>2022-11-07T17:07:44Z</wsu:Expires>
            </wsu:Timestamp>
        </wsse:Security>
    </env:Header>
    <soap:Body>
        <RetrieveResponseMsg
            xmlns="http://exacttarget.com/wsdl/partnerAPI">
            <OverallStatus>MoreDataAvailable</OverallStatus>
            <RequestID>asdfds455</RequestID>
            <Results xsi:type="DataExtensionObject">
                <PartnerKey xsi:nil="true" />
                <ObjectID xsi:nil="true" />
                <Type>DataExtensionObject</Type>
                <Properties>
                    <Property>
                        <Name>FIELD_NAME</Name>
                        <Value>asdfdfd12</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_2</Name>
                        <Value>asdf</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_3</Name>
                        <Value>fasdsa</Value>
                    </Property>
                </Properties>
            </Results>
            <Results xsi:type="DataExtensionObject">
                <PartnerKey xsi:nil="true" />
                <ObjectID xsi:nil="true" />
                <Type>DataExtensionObject</Type>
                <Properties>
                    <Property>
                        <Name>FIELD_NAME</Name>
                        <Value>fasd123</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_2</Name>
                        <Value>asdfd</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_3</Name>
                        <Value>a0A4f</Value>
                    </Property>
                </Properties>
            </Results>
            <Results xsi:type="DataExtensionObject">
                <PartnerKey xsi:nil="true" />
                <ObjectID xsi:nil="true" />
                <Type>DataExtensionObject</Type>
                <Properties>
                    <Property>
                        <Name>FIELD_NAME</Name>
                        <Value>0034P00</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_2</Name>
                        <Value>fasdfs</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_3</Name>
                        <Value>a0fasd</Value>
                    </Property>
                </Properties>
            </Results>
        </RetrieveResponseMsg>
    </soap:Body>
</soap:Envelope>

What I've Attempted So Far:

data_output = []
for el in soup_de.find_all('Property'):
    
    dict_ = {el.find('Name').text:el.find('Value').text}

    data_output.append(dict_)
    

print(len(data_output))
# print(data_output)


testing_de_df = pd.DataFrame(data_output)
    
    
display(testing_de_df.info())
display(testing_de_df.head(25))

Desired Output:

details = {'FIELD_NAME': ['asdfdfd12', 'fasd123', '0034P00'],
           'FIELD_NAME_2': ['asdf', 'asdfd', 'fasdfs'],
           'FIELD_NAME_3': ['fasdsa', 'a0A4f', 'a0fasd']}

desired_output = pd.DataFrame(details)

print(desired_output)

CodePudding user response:

Since <Property> sits at a shallow part of the XML, simply call pandas.read_xml narrowing in on that set of nodes while acknowledging the default namespace (http://exacttarget.com/wsdl/partnerAPI):

property_df = pd.read_xml(
    "Input.xml", 
    xpath = ".//rrm:Property",
    namespaces = {"rrm": "http://exacttarget.com/wsdl/partnerAPI"}
)

print(property_df)
#            Name      Value
# 0    FIELD_NAME  asdfdfd12
# 1  FIELD_NAME_2       asdf
# 2  FIELD_NAME_3     fasdsa
# 3    FIELD_NAME    fasd123
# 4  FIELD_NAME_2      asdfd
# 5  FIELD_NAME_3      a0A4f
# 6    FIELD_NAME    0034P00
# 7  FIELD_NAME_2     fasdfs
# 8  FIELD_NAME_3     a0fasd

To delineate by property, consider creating a property group number with groupby().cumcount() and reshaping data wide with pivot_table:

property_wide_df = (
    property_df
      .assign(property_no = lambda x: x.groupby("Name").cumcount().add(1))
      .pivot_table(index="property_no", columns="Name", values="Value", aggfunc="sum")
)

print(property_wide_df)
# Name        FIELD_NAME FIELD_NAME_2 FIELD_NAME_3
# property_no
# 1            asdfdfd12         asdf       fasdsa
# 2              fasd123        asdfd        a0A4f
# 3              0034P00       fasdfs       a0fasd
  • Related