Home > Software engineering >  Extracting data from xml with similar tag name using beautiful soup
Extracting data from xml with similar tag name using beautiful soup

Time:10-08

challenge in extracting data from a XML response of a web api using beautiful soup

need to loop through the entire xml response and fetch data from different tags and store into a data frame

below mentioned are the type of values that need to be extracted form xml and loop through the entire xml and load into data frame .

  • find the Ref value from <Value ref="52f3623a-497c0b0a154b">

  • Org value from <UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias> ( there are two tags with same name UniqueAlias and even not in the same order)

  • Value from the hierarchy tag <Hierarchy><![CDATA[Guide]]></Hierarchy>

  • Value from the additional field tag where attribute label = "country" <AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField>

  • Value from the additional field tag where attribute label = "PrStatus" <AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField>

XML Sample format :


<Value ref="52f3623a-497c0b0a154b"><DisplayName origin="UID"><![CDATA[10056546]]></DisplayName><DisplayName origin="Default"><![CDATA[Guide]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias><Hierarchy><![CDATA[Guide]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[ABCD/I]]></AdditionalField><AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>

<Value ref="4b0444e0-43137db45c1a"><DisplayName origin="Default"><![CDATA[Guide 3]]></DisplayName><UniqueAlias><![CDATA[ORG=EFG/C|Lim]]></UniqueAlias><UniqueAlias><![CDATA[STATUS=PMFDActive]]></UniqueAlias><Hierarchy><![CDATA[Guide 3]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[EFG/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[Malaysia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>

<Value ref="4d43bb96-c6b0ad9709ec"><DisplayName origin="GERL"><![CDATA[Salmon]]></DisplayName><DisplayName origin="UID"><![CDATA[1184797]]></DisplayName><DisplayName origin="Default"><![CDATA[Salmon]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=LJK/N|Yuly ]]></UniqueAlias><Hierarchy><![CDATA[Salmon]]></Hierarchy><AdditionalField label="Field"><![CDATA[Salmon 1]]></AdditionalField><AdditionalField label="Organisation"><![CDATA[LJK/N|Yuly ]]></AdditionalField><AdditionalField label="Country"><![CDATA[India]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>

<Value ref="1c0d6493-8f63c9043b5f"><DisplayName origin="Default"><![CDATA[Mini comp]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=xyz/C|Jason]]></UniqueAlias><Hierarchy><![CDATA[Mini comp]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[xyz/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[gorgeia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>

</valueList>

Python Code :

text= response.content

soup = BeautifulSoup(text, "html.parser")

data = []


for value in soup.valuelist.find_all('value'):
    additional_fields = [field.text for field in soup.find_all('additionalfield')]

    data.append([
            value['ref'],
            value.uniquealias.text,
            value.hierarchy.text,
            additional_fields[1],
            additional_fields[0],
            additional_fields[2],
    ])

    df = pd.DataFrame(data, columns=['ID', 'Status', 'Name', 'Country', 'org','ST'])

print(df)

Thanks in advance

CodePudding user response:

'pandas' 1.3 has a read_xml() function, that you could try to use. You'll have to play around with the parameters, as the data returned doesnt seem to be what you are after. There may be some nested nodes:

import pandas as pd

text = '''<valueList><Value ref="52f3623a-497c0b0a154b"><DisplayName origin="UID"><![CDATA[10056546]]></DisplayName><DisplayName origin="Default"><![CDATA[Guide]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias><Hierarchy><![CDATA[Guide]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[ABCD/I]]></AdditionalField><AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="4b0444e0-43137db45c1a"><DisplayName origin="Default"><![CDATA[Guide 3]]></DisplayName><UniqueAlias><![CDATA[ORG=EFG/C|Lim]]></UniqueAlias><UniqueAlias><![CDATA[STATUS=PMFDActive]]></UniqueAlias><Hierarchy><![CDATA[Guide 3]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[EFG/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[Malaysia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
<Value ref="4d43bb96-c6b0ad9709ec"><DisplayName origin="GERL"><![CDATA[Salmon]]></DisplayName><DisplayName origin="UID"><![CDATA[1184797]]></DisplayName><DisplayName origin="Default"><![CDATA[Salmon]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=LJK/N|Yuly ]]></UniqueAlias><Hierarchy><![CDATA[Salmon]]></Hierarchy><AdditionalField label="Field"><![CDATA[Salmon 1]]></AdditionalField><AdditionalField label="Organisation"><![CDATA[LJK/N|Yuly ]]></AdditionalField><AdditionalField label="Country"><![CDATA[India]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="1c0d6493-8f63c9043b5f"><DisplayName origin="Default"><![CDATA[Mini comp]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=xyz/C|Jason]]></UniqueAlias><Hierarchy><![CDATA[Mini comp]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[xyz/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[gorgeia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
</valueList>'''



df = pd.read_xml(text)

Output:

print(df)
                     ref DisplayName  ...  Hierarchy AdditionalField
0  52f3623a-497c0b0a154b       Guide  ...      Guide         DActive
1  4b0444e0-43137db45c1a     Guide 3  ...    Guide 3          Active
2  4d43bb96-c6b0ad9709ec      Salmon  ...     Salmon         DActive
3  1c0d6493-8f63c9043b5f   Mini comp  ...  Mini comp          Active

OPTION 2:

You can use your code, as it's fine. But would need to change a few things in there:

import pandas as pd
from bs4 import BeautifulSoup

text = '''<Value ref="52f3623a-497c0b0a154b"><DisplayName origin="UID"><![CDATA[10056546]]></DisplayName><DisplayName origin="Default"><![CDATA[Guide]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias><Hierarchy><![CDATA[Guide]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[ABCD/I]]></AdditionalField><AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="4b0444e0-43137db45c1a"><DisplayName origin="Default"><![CDATA[Guide 3]]></DisplayName><UniqueAlias><![CDATA[ORG=EFG/C|Lim]]></UniqueAlias><UniqueAlias><![CDATA[STATUS=PMFDActive]]></UniqueAlias><Hierarchy><![CDATA[Guide 3]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[EFG/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[Malaysia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
<Value ref="4d43bb96-c6b0ad9709ec"><DisplayName origin="GERL"><![CDATA[Salmon]]></DisplayName><DisplayName origin="UID"><![CDATA[1184797]]></DisplayName><DisplayName origin="Default"><![CDATA[Salmon]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=LJK/N|Yuly ]]></UniqueAlias><Hierarchy><![CDATA[Salmon]]></Hierarchy><AdditionalField label="Field"><![CDATA[Salmon 1]]></AdditionalField><AdditionalField label="Organisation"><![CDATA[LJK/N|Yuly ]]></AdditionalField><AdditionalField label="Country"><![CDATA[India]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="1c0d6493-8f63c9043b5f"><DisplayName origin="Default"><![CDATA[Mini comp]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=xyz/C|Jason]]></UniqueAlias><Hierarchy><![CDATA[Mini comp]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[xyz/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[gorgeia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
</valueList>'''

soup = BeautifulSoup(text, "html.parser")
data = []

for value in soup.find_all('value'):
    try:
        vId = value['ref']
    except:
        vId = 'N/A'
        print('id not found')
    try:
        status = value.uniquealias.text
    except:
        status = 'N/A'
        print('status not found')
    try:
        name = value.hierarchy.text
    except:
        name = 'N/A'
        print('name not found')
    try:
        country = value.find('additionalfield', {'label':'Country'}).text
    except:
        country = 'N/A'
        print('country not found')
    try:
        org = value.find('additionalfield', {'label':'Organisation'}).text
    except:
        org = 'N/A'
        print('org not found')
    try:
        st = value.find('additionalfield', {'label':'PrStatus'}).text
    except:
        st = 'N/A'
        print('st not found')
        
    row = {
        'ID':vId,
        'Status':status,
        'Name':name,
        'Country':country,
        'org':org,
        'ST':st}
    data.append(row)

df = pd.DataFrame(data)
print(df)

Output:

ID         Status  ...          org       ST
0  52f3623a-497c0b0a154b  STATUS=Active  ...       ABCD/I  DActive
1  4b0444e0-43137db45c1a  ORG=EFG/C|Lim  ...        EFG/C   Active
2  4d43bb96-c6b0ad9709ec  STATUS=Active  ...  LJK/N|Yuly   DActive
3  1c0d6493-8f63c9043b5f  STATUS=Active  ...        xyz/C   Active
  • Related