challenge in extracting data from a XML response of a web api using beautiful soup
need to loop through the entire xml response and fetch data from different tags and store into a data frame
below mentioned are the type of values that need to be extracted form xml and loop through the entire xml and load into data frame .
find the Ref value from
<Value ref="52f3623a-497c0b0a154b">
Org value from
<UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias>
( there are two tags with same name UniqueAlias and even not in the same order)Value from the hierarchy tag
<Hierarchy><![CDATA[Guide]]></Hierarchy>
Value from the additional field tag where attribute label = "country"
<AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField>
Value from the additional field tag where attribute label = "PrStatus"
<AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField>
XML Sample format :
<Value ref="52f3623a-497c0b0a154b"><DisplayName origin="UID"><![CDATA[10056546]]></DisplayName><DisplayName origin="Default"><![CDATA[Guide]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias><Hierarchy><![CDATA[Guide]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[ABCD/I]]></AdditionalField><AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="4b0444e0-43137db45c1a"><DisplayName origin="Default"><![CDATA[Guide 3]]></DisplayName><UniqueAlias><![CDATA[ORG=EFG/C|Lim]]></UniqueAlias><UniqueAlias><![CDATA[STATUS=PMFDActive]]></UniqueAlias><Hierarchy><![CDATA[Guide 3]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[EFG/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[Malaysia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
<Value ref="4d43bb96-c6b0ad9709ec"><DisplayName origin="GERL"><![CDATA[Salmon]]></DisplayName><DisplayName origin="UID"><![CDATA[1184797]]></DisplayName><DisplayName origin="Default"><![CDATA[Salmon]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=LJK/N|Yuly ]]></UniqueAlias><Hierarchy><![CDATA[Salmon]]></Hierarchy><AdditionalField label="Field"><![CDATA[Salmon 1]]></AdditionalField><AdditionalField label="Organisation"><![CDATA[LJK/N|Yuly ]]></AdditionalField><AdditionalField label="Country"><![CDATA[India]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="1c0d6493-8f63c9043b5f"><DisplayName origin="Default"><![CDATA[Mini comp]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=xyz/C|Jason]]></UniqueAlias><Hierarchy><![CDATA[Mini comp]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[xyz/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[gorgeia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
</valueList>
Python Code :
text= response.content
soup = BeautifulSoup(text, "html.parser")
data = []
for value in soup.valuelist.find_all('value'):
additional_fields = [field.text for field in soup.find_all('additionalfield')]
data.append([
value['ref'],
value.uniquealias.text,
value.hierarchy.text,
additional_fields[1],
additional_fields[0],
additional_fields[2],
])
df = pd.DataFrame(data, columns=['ID', 'Status', 'Name', 'Country', 'org','ST'])
print(df)
Thanks in advance
CodePudding user response:
'pandas' 1.3 has a read_xml()
function, that you could try to use. You'll have to play around with the parameters, as the data returned doesnt seem to be what you are after. There may be some nested nodes:
import pandas as pd
text = '''<valueList><Value ref="52f3623a-497c0b0a154b"><DisplayName origin="UID"><![CDATA[10056546]]></DisplayName><DisplayName origin="Default"><![CDATA[Guide]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias><Hierarchy><![CDATA[Guide]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[ABCD/I]]></AdditionalField><AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="4b0444e0-43137db45c1a"><DisplayName origin="Default"><![CDATA[Guide 3]]></DisplayName><UniqueAlias><![CDATA[ORG=EFG/C|Lim]]></UniqueAlias><UniqueAlias><![CDATA[STATUS=PMFDActive]]></UniqueAlias><Hierarchy><![CDATA[Guide 3]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[EFG/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[Malaysia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
<Value ref="4d43bb96-c6b0ad9709ec"><DisplayName origin="GERL"><![CDATA[Salmon]]></DisplayName><DisplayName origin="UID"><![CDATA[1184797]]></DisplayName><DisplayName origin="Default"><![CDATA[Salmon]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=LJK/N|Yuly ]]></UniqueAlias><Hierarchy><![CDATA[Salmon]]></Hierarchy><AdditionalField label="Field"><![CDATA[Salmon 1]]></AdditionalField><AdditionalField label="Organisation"><![CDATA[LJK/N|Yuly ]]></AdditionalField><AdditionalField label="Country"><![CDATA[India]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="1c0d6493-8f63c9043b5f"><DisplayName origin="Default"><![CDATA[Mini comp]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=xyz/C|Jason]]></UniqueAlias><Hierarchy><![CDATA[Mini comp]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[xyz/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[gorgeia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
</valueList>'''
df = pd.read_xml(text)
Output:
print(df)
ref DisplayName ... Hierarchy AdditionalField
0 52f3623a-497c0b0a154b Guide ... Guide DActive
1 4b0444e0-43137db45c1a Guide 3 ... Guide 3 Active
2 4d43bb96-c6b0ad9709ec Salmon ... Salmon DActive
3 1c0d6493-8f63c9043b5f Mini comp ... Mini comp Active
OPTION 2:
You can use your code, as it's fine. But would need to change a few things in there:
import pandas as pd
from bs4 import BeautifulSoup
text = '''<Value ref="52f3623a-497c0b0a154b"><DisplayName origin="UID"><![CDATA[10056546]]></DisplayName><DisplayName origin="Default"><![CDATA[Guide]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias><Hierarchy><![CDATA[Guide]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[ABCD/I]]></AdditionalField><AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="4b0444e0-43137db45c1a"><DisplayName origin="Default"><![CDATA[Guide 3]]></DisplayName><UniqueAlias><![CDATA[ORG=EFG/C|Lim]]></UniqueAlias><UniqueAlias><![CDATA[STATUS=PMFDActive]]></UniqueAlias><Hierarchy><![CDATA[Guide 3]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[EFG/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[Malaysia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
<Value ref="4d43bb96-c6b0ad9709ec"><DisplayName origin="GERL"><![CDATA[Salmon]]></DisplayName><DisplayName origin="UID"><![CDATA[1184797]]></DisplayName><DisplayName origin="Default"><![CDATA[Salmon]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=LJK/N|Yuly ]]></UniqueAlias><Hierarchy><![CDATA[Salmon]]></Hierarchy><AdditionalField label="Field"><![CDATA[Salmon 1]]></AdditionalField><AdditionalField label="Organisation"><![CDATA[LJK/N|Yuly ]]></AdditionalField><AdditionalField label="Country"><![CDATA[India]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="1c0d6493-8f63c9043b5f"><DisplayName origin="Default"><![CDATA[Mini comp]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=xyz/C|Jason]]></UniqueAlias><Hierarchy><![CDATA[Mini comp]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[xyz/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[gorgeia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
</valueList>'''
soup = BeautifulSoup(text, "html.parser")
data = []
for value in soup.find_all('value'):
try:
vId = value['ref']
except:
vId = 'N/A'
print('id not found')
try:
status = value.uniquealias.text
except:
status = 'N/A'
print('status not found')
try:
name = value.hierarchy.text
except:
name = 'N/A'
print('name not found')
try:
country = value.find('additionalfield', {'label':'Country'}).text
except:
country = 'N/A'
print('country not found')
try:
org = value.find('additionalfield', {'label':'Organisation'}).text
except:
org = 'N/A'
print('org not found')
try:
st = value.find('additionalfield', {'label':'PrStatus'}).text
except:
st = 'N/A'
print('st not found')
row = {
'ID':vId,
'Status':status,
'Name':name,
'Country':country,
'org':org,
'ST':st}
data.append(row)
df = pd.DataFrame(data)
print(df)
Output:
ID Status ... org ST
0 52f3623a-497c0b0a154b STATUS=Active ... ABCD/I DActive
1 4b0444e0-43137db45c1a ORG=EFG/C|Lim ... EFG/C Active
2 4d43bb96-c6b0ad9709ec STATUS=Active ... LJK/N|Yuly DActive
3 1c0d6493-8f63c9043b5f STATUS=Active ... xyz/C Active