I'm receiving the below XML response from an API call and am looking to iterate through the "Results" and store all of the data points as a pandas dataframe.
I was successfully able to grab my data points of interest by chaining .find() methods shown below, but don't know how to loop through all of the Results block within the body given the structure of the XML response.
I am using Python 3.7 in Jupyter on Windows.
What I've Tried:
import pandas as pd
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
soup = BeautifulSoup(soap_response.text, "xml")
# print(soup.prettify())
objectid_field = soup.find('Results').find('ObjectID').text
customerkey_field = soup.find('Results').find('CustomerKey').text
name_field = soup.find('Results').find('Name').text
issendable_field = name_field = soup.find('Results').find('IsSendable').text
sendablesubscribe_field = soup.find('Results').find('SendableSubscriberField').text
# for de in soup:
# de_name = soup.find('Results').find('Name').text
# print(de_name)
# test_df = pd.read_xml(soup,
# xpath="//Results",
# namespaces={""})
Sample XML Data Structure:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
xmlns:soap="http://www.w3.org/2003/soap-envelope"
xmlns:xsi="http://www.w3.org/2001/XMLSchema"
xmlns:xsd="http://www.w3.org/XMLSchema"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing"
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-201-wss-wssecurity-secext-1.0.xsd"
xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-201-wss-security-1.0.xsd">
<env:Header
xmlns:env="http://www.w3.org/2003/05/soap-envelope">
<wsa:Action>RetrieveResponse</wsa:Action>
<wsa:MessageID>urn:uuid:1234</wsa:MessageID>
<wsa:RelatesTo>urn:uuid:1234</wsa:RelatesTo>
<wsa:To>http://schemas.xmlsoap.org/ws/2004/08/dressing/role/anonymous</wsa:To>
<wsse:Security>
<wsu:Timestamp wsu:Id="Timestamp-1234">
<wsu:Created>2021-11-07T13:10:54Z</wsu:Created>
<wsu:Expires>2021-11-07T13:15:54Z</wsu:Expires>
</wsu:Timestamp>
</wsse:Security>
</env:Header>
<soap:Body>
<RetrieveResponseMsg
xmlns="http://partnerAPI">
<OverallStatus>OK</OverallStatus>
<RequestID>f9876</RequestID>
<Results xsi:type="Data">
<PartnerKey xsi:nil="true" />
<ObjectID>Object1</ObjectID>
<CustomerKey>Customer1</CustomerKey>
<Name>Test1</Name>
<IsSendable>true</IsSendable>
<SendableSubscriberField>
<Name>_Something1</Name>
</SendableSubscriberField>
</Results>
<Results xsi:type="Data">
<PartnerKey xsi:nil="true" />
<ObjectID>Object2</ObjectID>
<CustomerKey>Customer2</CustomerKey>
<Name>Name2</Name>
<IsSendable>true</IsSendable>
<SendableSubscriberField>
<Name>_Something2</Name>
</SendableSubscriberField>
</Results>
<Results xsi:type="Data">
<PartnerKey xsi:nil="true" />
<ObjectID>Object3</ObjectID>
<CustomerKey>AnotherKey</CustomerKey>
<Name>Something3</Name>
<IsSendable>false</IsSendable>
</Results>
</RetrieveResponseMsg>
</soap:Body>
</soap:Envelope>'
CodePudding user response:
You're super close, you need to find all of the Results
tags, then iterate over them, last grabbing the elements you want:
for el in soup.find_all('Results'):
objectid_field = el.find('ObjectID').text
customerkey_field = el.find('CustomerKey').text
name_field = el.find('Name').text
issendable_field = name_field = el.find('IsSendable').text
sendablesubscribe_field = el.find('SendableSubscriberField').text
However, SendableSubscriberField
isn't always there, so you might need to check if sendable
is True first:
for el in soup.find_all('Results'):
objectid_field = el.find('ObjectID').text
customerkey_field = el.find('CustomerKey').text
name_field = el.find('Name').text
issendable_field = el.find('IsSendable').text
# skip if not sendable
if issendable_field == 'false':
sendablesubscribe_field = None
continue
sendablesubscribe_field = el.find('SendableSubscriberField').find('Name').text
Edit: Constructing the dataframe
To build the dataframe from this, I'd collect everything into a list
of dictionaries:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(...)
data = []
for el in soup.find_all('Results'):
record = {}
record['ObjectID'] = el.find('ObjectID').text
record['CustomerKey'] = el.find('CustomerKey').text
record['Name'] = el.find('Name').text
record['IsSendable'] = el.find('IsSendable').text
# skip if not sendable
if record['IsSendable'] == 'false':
record['SendableSubscriberField'] = None
continue
record['SendableSubscriberField'] = el.find('SendableSubscriberField').find('Name').text
data.append(record)
df = pd.DataFrame(data)
CodePudding user response:
Reconsider use of pandas.read_xml
by acknowledging the default namespace (http://partnerAPI
). Also, since you need a lower-level value, run read_xml
twice and join
the results. Notice all attribute and element values are returned even if missing.
soap_df = (
pd.read_xml(
soap_response.text,
xpath = ".//rrm:RetrieveResponseMsg/rrm:Results",
namespaces = {"rrm": "http://partnerAPI"}
).join(
pd.read_xml(
soap_response.text,
xpath = ".//rrm:RetrieveResponseMsg/rrm:Results/rrm:SendableSubscriberField",
namespaces = {"rrm": "http://partnerAPI"},
names = ["SendableSubscriberField_Name", ""]
),
)
)
print(soap_df)
# type PartnerKey ObjectID CustomerKey Name IsSendable SendableSubscriberField SendableSubscriberField_Name
# 0 Data NaN Object1 Customer1 Test1 True NaN _Something1
# 1 Data NaN Object2 Customer2 Name2 True NaN _Something2
# 2 Data NaN Object3 AnotherKey Something3 False NaN NaN