I have a XML URL that i am trying to read in Python. The XML contains a large dataset of around 50-60K products.
Example of two products in the XML:
<?xml version='1.0' encoding='utf-8'?>
<channel>
<title>
Google Shopping NL
</title>
<description>
Google Shopping NL
</description>
<pubDate>
Tue, 10 Jan 2023 09:30:35 -0000
</pubDate>
<item>
<ecomm_prodid>123456</ecomm_prodid>
<g:gtin>8714567834276</g:gtin>
<g:price>17.95 EUR</g:price>
<title>Unique Living Teddy plaid - Bruin - 200x150cm</title>
</item>
<item>
<ecomm_prodid>56789</ecomm_prodid>
<g:gtin>871987731105</g:gtin>
<g:price>29.90 EUR</g:price>
<title>Tristar OV-1431 oven 35x25 - 800W - 230V</title>
</item>
I want to read the XML and loop trough each 'item' to check if that item contains a certain 'ecomm_prodid' so that I can retrieve the 'g:gtin' of that product. Is that the best way, and if so, how would I be able to achieve that?
Kind regards:)
CodePudding user response:
Use ElementTree to parse your XMl and to iterate over the items in your XML, then use find on those items to get the corresponding sub-tags.
Example:
import xml.etree.ElementTree as ET
xmldata = """<?xml version='1.0' encoding='utf-8'?>
<channel xmlns:g="base.google.com/ns/1.0">
<title>
Google Shopping NL
</title>
<description>
Google Shopping NL
</description>
<pubDate>
Tue, 10 Jan 2023 09:30:35 -0000
</pubDate>
<item>
<ecomm_prodid>123456</ecomm_prodid>
<g:gtin>8714567834276</g:gtin>
<g:price>17.95 EUR</g:price>
<title>Unique Living Teddy plaid - Bruin - 200x150cm</title>
</item>
<item>
<ecomm_prodid>56789</ecomm_prodid>
<g:gtin>871987731105</g:gtin>
<g:price>29.90 EUR</g:price>
<title>Tristar OV-1431 oven 35x25 - 800W - 230V</title>
</item>
</channel>
"""
xml = ET.fromstring(xmldata) # use ET.parse(filename) to parse from file
for item in xml.findall('item'):
prodid = item.find('ecomm_prodid').text
gtin = item.find('g:gtin', {"g": "base.google.com/ns/1.0"}).text
# Now you can access prodid and gtin
print(f"{prodid} - {gtin}")
CodePudding user response:
With pandas you can parse the items with read_xml():
import pandas as pd
ns = {"xmlns:g" : "http://base.google.com/ns/1.0"}
df = pd.read_xml("google.xml", xpath=".//item", namespaces=ns)
print(df[['ecomm_prodid', 'gtin']])