What would be the best way to construct a DF from the below nested XML data?
Each "properties" element has three "property" elements nested containing the "name" and "value" of our data. I tried doing two for loops, pandas read_xml option, and a few other pieces but haven't quite gotten the nested logic figured out. My current approach below is closer, but does not keep the names and values together.
Using Python 3.7 in Jupyter on windows
Sample XML Data:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing"
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"
xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
<env:Header
xmlns:env="http://www.w3.org/2003/05/soap-envelope">
<wsa:Action>RetrieveResponse</wsa:Action>
<wsa:MessageID>urn:uuid:1234</wsa:MessageID>
<wsa:RelatesTo>urn:uuid:1234</wsa:RelatesTo>
<wsa:To>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To>
<wsse:Security>
<wsu:Timestamp wsu:Id="Timestamp-45333">
<wsu:Created>2022-11-07T17:02:44Z</wsu:Created>
<wsu:Expires>2022-11-07T17:07:44Z</wsu:Expires>
</wsu:Timestamp>
</wsse:Security>
</env:Header>
<soap:Body>
<RetrieveResponseMsg
xmlns="http://exacttarget.com/wsdl/partnerAPI">
<OverallStatus>MoreDataAvailable</OverallStatus>
<RequestID>asdfds455</RequestID>
<Results xsi:type="DataExtensionObject">
<PartnerKey xsi:nil="true" />
<ObjectID xsi:nil="true" />
<Type>DataExtensionObject</Type>
<Properties>
<Property>
<Name>FIELD_NAME</Name>
<Value>asdfdfd12</Value>
</Property>
<Property>
<Name>FIELD_NAME_2</Name>
<Value>asdf</Value>
</Property>
<Property>
<Name>FIELD_NAME_3</Name>
<Value>fasdsa</Value>
</Property>
</Properties>
</Results>
<Results xsi:type="DataExtensionObject">
<PartnerKey xsi:nil="true" />
<ObjectID xsi:nil="true" />
<Type>DataExtensionObject</Type>
<Properties>
<Property>
<Name>FIELD_NAME</Name>
<Value>fasd123</Value>
</Property>
<Property>
<Name>FIELD_NAME_2</Name>
<Value>asdfd</Value>
</Property>
<Property>
<Name>FIELD_NAME_3</Name>
<Value>a0A4f</Value>
</Property>
</Properties>
</Results>
<Results xsi:type="DataExtensionObject">
<PartnerKey xsi:nil="true" />
<ObjectID xsi:nil="true" />
<Type>DataExtensionObject</Type>
<Properties>
<Property>
<Name>FIELD_NAME</Name>
<Value>0034P00</Value>
</Property>
<Property>
<Name>FIELD_NAME_2</Name>
<Value>fasdfs</Value>
</Property>
<Property>
<Name>FIELD_NAME_3</Name>
<Value>a0fasd</Value>
</Property>
</Properties>
</Results>
</RetrieveResponseMsg>
</soap:Body>
</soap:Envelope>
What I've Attempted So Far:
data_output = []
for el in soup_de.find_all('Property'):
dict_ = {el.find('Name').text:el.find('Value').text}
data_output.append(dict_)
print(len(data_output))
# print(data_output)
testing_de_df = pd.DataFrame(data_output)
display(testing_de_df.info())
display(testing_de_df.head(25))
Desired Output:
details = {'FIELD_NAME': ['asdfdfd12', 'fasd123', '0034P00'],
'FIELD_NAME_2': ['asdf', 'asdfd', 'fasdfs'],
'FIELD_NAME_3': ['fasdsa', 'a0A4f', 'a0fasd']}
desired_output = pd.DataFrame(details)
print(desired_output)
CodePudding user response:
Since <Property>
sits at a shallow part of the XML, simply call pandas.read_xml
narrowing in on that set of nodes while acknowledging the default namespace (http://exacttarget.com/wsdl/partnerAPI
):
property_df = pd.read_xml(
"Input.xml",
xpath = ".//rrm:Property",
namespaces = {"rrm": "http://exacttarget.com/wsdl/partnerAPI"}
)
print(property_df)
# Name Value
# 0 FIELD_NAME asdfdfd12
# 1 FIELD_NAME_2 asdf
# 2 FIELD_NAME_3 fasdsa
# 3 FIELD_NAME fasd123
# 4 FIELD_NAME_2 asdfd
# 5 FIELD_NAME_3 a0A4f
# 6 FIELD_NAME 0034P00
# 7 FIELD_NAME_2 fasdfs
# 8 FIELD_NAME_3 a0fasd
To delineate by property, consider creating a property group number with groupby().cumcount()
and reshaping data wide with pivot_table
:
property_wide_df = (
property_df
.assign(property_no = lambda x: x.groupby("Name").cumcount().add(1))
.pivot_table(index="property_no", columns="Name", values="Value", aggfunc="sum")
)
print(property_wide_df)
# Name FIELD_NAME FIELD_NAME_2 FIELD_NAME_3
# property_no
# 1 asdfdfd12 asdf fasdsa
# 2 fasd123 asdfd a0A4f
# 3 0034P00 fasdfs a0fasd