How to pull a value out of an element in a nested XML document in Python?-CodePudding

I'm asking an API to look up part numbers I get from a user with a barcode scanner. The API returns a much longer document than the below code block, but I trimmed a bunch of unnecessary empty elements, but the structure of the document is still the same. I need to put each part number in a dictionary where the value is the text inside of the <mfgr> element. With each run of my program, I generate a list of part numbers and have a loop that asks the API about each item in my list and each returns a huge document as expected. I'm a bit stuck on trying to parse the XML and get only the text inside of <mfgr> element, then save it to a dictionary with the part number that it belongs to. I'll put my loop that goes through my list below the XML document

<ArrayOfitem xmlns="WhereDataComesFrom.com" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
    <item>
        <associateditem_swap/>
        <bulk>false</bulk>
        <category>Memory</category>
        <clei>false</clei>
        <createddate>5/11/2021 7:34:58 PM</createddate>
        <description>sample description</description>
        <heci/>
        <imageurl/>
        <item_swap/>
        <itemid>1640</itemid>
        <itemnumber>**sample part number**</itemnumber>
        <listprice>0.0000</listprice>
        <manufactureritem/>
        <maxavailable>66</maxavailable>
        <mfgr>**sample manufacturer**</mfgr>
        <minreorderqty>0</minreorderqty>
        <noninventory>false</noninventory>
        <primarylocation/>
        <reorderpoint>0</reorderpoint>
        <rep>AP</rep>
        <type>Memory                                  </type>
        <updateddate>2/4/2022 2:22:51 PM</updateddate>
        <warehouse>MAIN</warehouse>
    </item>
</ArrayOfitem>

Below is my Python code that loops through the part number list and asks the API to look up each part number.

import http.client
import xml.etree.ElementTree as etree
raw_xml = None
pn_list=["samplepart1","samplepart2"]
api_key= **redacted lol**

def getMFGR():
    global raw_xml
    for part_number in pn_list:
        conn = http.client.HTTPSConnection("api.website.com")
        payload = ''
        headers = {
        'session-token': 'api_key',
        'Cookie': 'firstpartofmycookie; secondpartofmycookie'
        }
        conn.request("GET", "/webapi.svc/MI/XML/GetItemsByItemNumber?ItemNumber=" part_number, payload, headers)
        res = conn.getresponse()
        data = res.read()
        raw_xml = data.decode("utf-8")
        print(raw_xml)
        print()

getMFGR()

Here is some code I tried while trying to get the mfgr. It will go inside the getMFGR() method inside the for loop so that it saves the manufacturer to a variable with each loop. Once the code works I want to have the dictionary look like this: {"samplepart1": "manufacturer1", "samplepart2": "manufacturer2"}.

root = etree.fromstring(raw_xml)
my_ns = {'root': 'WhereDataComesFrom.com'}

mfgr = root.findall('root:mfgr',my_ns)[0].text

The code above gives me a list index out of range error when I run it. I don't think it's searching past the namespaces node but I'm not sure how to tell it to search further.

CodePudding user response：

This is where an interactive session becomes very useful. Drop your XML data into a file (say, data.xml), and then start up a Python REPL:

>>> import xml.etree.ElementTree as etree
>>> with open('data.xml') as fd:
...     raw_xml=fd.read()
...
>>> root = etree.fromstring(raw_xml)
>>> my_ns = {'root': 'WhereDataComesFrom.com'}

Let's first look at your existing xpath expression:

>>> root.findall('root:mfgr',my_ns)
[]

That returns an empty list, which is why you're getting an "index out of range" error. You're getting an empty list because there is no mfgr element at the top level of the document; it's contained in an <item> element. So this will work:

>>> root.findall('root:item/root:mfgr',my_ns)
[<Element '{WhereDataComesFrom.com}mfgr' at 0x7fa5a45e2b60>]

To actually get the contents of that element:

>>> [x.text for x in root.findall('root:item/root:mfgr',my_ns)]
['**sample manufacturer**']

Hopefully that's enough to point you in the right direction.

CodePudding user response：

I suggest use pandas for this structure of XML:

import pandas as pd

# Read XML row into DataFrame
ns = {"xmlns":"WhereDataComesFrom.com", "xmlns:i":"http://www.w3.org/2001/XMLSchema-instance"}
df = pd.read_xml("parNo_plant.xml", xpath=".//xmlns:item", namespaces=ns)

# Print only columns of interesst
df_of_interest = df[['itemnumber', 'mfgr']]
print(df_of_interest,'\n')

#Print the dictionary from DataFrame
print(df_of_interest.to_dict(orient='records'))

# If I understood right, you search this layout:
dictionary = dict(zip(df.itemnumber, df.mfgr))
print(dictionary)

Result (Pandas dataframe or dictionary):

               itemnumber                     mfgr
0  **sample part number**  **sample manufacturer** 

[{'itemnumber': '**sample part number**', 'mfgr': '**sample manufacturer**'}]

{'**sample part number**': '**sample manufacturer**'}