Home > OS >  How to parse a XML URL in Python and loop through each item?
How to parse a XML URL in Python and loop through each item?

Time:01-14

I have a XML URL that i am trying to read in Python. The XML contains a large dataset of around 50-60K products.

Example of two products in the XML:

<?xml version='1.0' encoding='utf-8'?>
<channel>
    <title>
        Google Shopping NL
    </title>
    <description>
        Google Shopping NL
    </description>
    <pubDate>
        Tue, 10 Jan 2023 09:30:35 -0000
    </pubDate>
    <item>
        <ecomm_prodid>123456</ecomm_prodid>
        <g:gtin>8714567834276</g:gtin>
        <g:price>17.95 EUR</g:price>
        <title>Unique Living Teddy plaid - Bruin - 200x150cm</title>
    </item>
    <item>
        <ecomm_prodid>56789</ecomm_prodid>
        <g:gtin>871987731105</g:gtin>
        <g:price>29.90 EUR</g:price>
        <title>Tristar OV-1431 oven 35x25 - 800W - 230V</title>
    </item>

I want to read the XML and loop trough each 'item' to check if that item contains a certain 'ecomm_prodid' so that I can retrieve the 'g:gtin' of that product. Is that the best way, and if so, how would I be able to achieve that?

Kind regards:)

CodePudding user response:

Use ElementTree to parse your XMl and to iterate over the items in your XML, then use find on those items to get the corresponding sub-tags.

Example:

import xml.etree.ElementTree as ET

xmldata = """<?xml version='1.0' encoding='utf-8'?>
<channel xmlns:g="base.google.com/ns/1.0">
    <title>
        Google Shopping NL
    </title>
    <description>
        Google Shopping NL
    </description>
    <pubDate>
        Tue, 10 Jan 2023 09:30:35 -0000
    </pubDate>
    <item>
        <ecomm_prodid>123456</ecomm_prodid>
        <g:gtin>8714567834276</g:gtin>
        <g:price>17.95 EUR</g:price>
        <title>Unique Living Teddy plaid - Bruin - 200x150cm</title>
    </item>
    <item>
        <ecomm_prodid>56789</ecomm_prodid>
        <g:gtin>871987731105</g:gtin>
        <g:price>29.90 EUR</g:price>
        <title>Tristar OV-1431 oven 35x25 - 800W - 230V</title>
    </item>
</channel>
"""

xml = ET.fromstring(xmldata) # use ET.parse(filename) to parse from file
for item in xml.findall('item'):
    prodid = item.find('ecomm_prodid').text
    gtin = item.find('g:gtin', {"g": "base.google.com/ns/1.0"}).text
    # Now you can access prodid and gtin
    print(f"{prodid} - {gtin}")

CodePudding user response:

With pandas you can parse the items with read_xml():

import pandas as pd

ns = {"xmlns:g" : "http://base.google.com/ns/1.0"}
df = pd.read_xml("google.xml", xpath=".//item", namespaces=ns)
print(df[['ecomm_prodid', 'gtin']])
  • Related