I'm attempting to parse out a fairly nested XML file. I've spent the last few hours trying to find a solution with no luck. I'm not sure if the issue is with namespaces, or needing to findall within the loop.
I am able to extract the higher level elements but the deeper nested elements are not being extracted. I am looking to export Part_number, manufacturer_name, name, Product and Retail to a df.
XML sample here (there isn't perfect uniformity across all submissions, some missing fields):
<?xml version="1.0" encoding="UTF-8"?><merchandiser xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="merchandiser.xsd"><header><merchantId>35386</merchantId><merchantName>Rock Bottom Golf</merchantName><createdOn>10/13/2021 14:01:49</createdOn></header>
<product product_id='15' name='Champ Golf- Max Pro Spike Wrench' sku_number='19CHPSPWRCH1111111111101' manufacturer_name='Champ Golf' part_number='19CHPSPWRCH1111111111101'><category><primary>Sporting Goods</primary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&offerid=301124.15&type=15&murl=http://www.rockbottomgolf.com/accessories/other/champ-golf-max-pro-spike-wrench/?utm_source=rakuten&utm_medium=cse&utm_term=19CHPSPWRCH1111111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19chpspwrch1111111111101.jpg</productImage></URL><description><short>A convenient and easy to use tool. No more struggling with your spikes. Features: Comfortable contoured soft touch dual density handle Three position ratchet for insertion, removal or lock in place Three bits to fit any spike, all will fit in drills Stand</short><long>A convenient and easy to use tool. No more struggling with your spikes. Features: Comfortable contoured soft touch dual density handle Three position ratchet for insertion, removal or lock in place Three bits to fit any spike, all will fit in drills Stand</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>9.99</retail></price><brand>Champ Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00036504884013</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&bids=301124.15&type=15&subid=0</pixel><modification>U</modification></product>
<product product_id='21' name='Stinger Tees- 3" Stinger Pro XL Competition Camo Mid Pack Poly Bag [125 Count]' sku_number='19STGTEEMID3CO1111111101' manufacturer_name='Stinger Tees' part_number='19STGTEEMID3CO1111111101'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&offerid=301124.21&type=15&murl=http://www.rockbottomgolf.com/accessories/tees/stinger-tees-3-stinger-pro-xl-competition-camo-mid-pack-poly-bag-125-count/?utm_source=rakuten&utm_medium=cse&utm_term=19STGTEEMID3CO1111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/3 tees 125 count.jpg</productImage></URL><description><short>Features: Resealable package Less resistance due to a smaller tee head Built to withstand the strongest swings High-quality 120 Tees</short><long>Features: Resealable package Less resistance due to a smaller tee head Built to withstand the strongest swings High-quality 120 Tees</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>7.99</retail></price><brand>Stinger Tees</brand><shipping><availability>in-stock</availability></shipping><upc>00853190005047</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&bids=301124.21&type=15&subid=0</pixel><modification>U</modification></product>
<product product_id='23' name='Vegas Golf- Original Game' sku_number='19VEGORIGIN1111111111101' manufacturer_name='Vegas Golf' part_number='19VEGORIGIN1111111111101'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&offerid=301124.23&type=15&murl=http://www.rockbottomgolf.com/accessories/other/vegas-golf-original-game/?utm_source=rakuten&utm_medium=cse&utm_term=19VEGORIGIN1111111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19vegorigin1111111111101.jpg</productImage></URL><description><short>For a limited time only, you'll get 2 bonus chips with your purchase for a total of 10 game chips! Vegas Golf: the ultimate on-the-course gambling game. Vegas Golf consists of real casino style chips, the object is to avoid the negative and obtain the pos</short><long>For a limited time only, you'll get 2 bonus chips with your purchase for a total of 10 game chips! Vegas Golf: the ultimate on-the-course gambling game. Vegas Golf consists of real casino style chips, the object is to avoid the negative and obtain the pos</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>14.99</retail></price><brand>Vegas Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00689076007030</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&bids=301124.23&type=15&subid=0</pixel><modification>U</modification></product>
<product product_id='28' name='Ray Cook Golf- 12' Compact Cup Ball Retriever' sku_number='19RAYBALRET1111111111201' manufacturer_name='Ray Cook Golf' part_number='19RAYBALRET1111111111201'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&offerid=301124.28&type=15&murl=http://www.rockbottomgolf.com/accessories/ball-retrievers/ray-cook-golf-12-compact-cup-ball-retriever/?utm_source=rakuten&utm_medium=cse&utm_term=19RAYBALRET1111111111201</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19raybalret12.jpg</productImage></URL><description><short>The Ray Cook Golf Ball Retriever extends up to 12 feet and is the perfect companion for every golf bag. Features: Durable construction Telescoping shaft design makes the retriever easy to carry</short><long>The Ray Cook Golf Ball Retriever extends up to 12 feet and is the perfect companion for every golf bag. Features: Durable construction Telescoping shaft design makes the retriever easy to carry</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>19.99</retail></price><brand>Ray Cook Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00840254178410</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&bids=301124.28&type=15&subid=0</pixel><modification>U</modification></product>
I have created the below python code which pulls out part_number, manufacturer_name and name while the other two are elusive.
My code:
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse(r"file.xml")
xroot = xtree.getroot()
df_cols = ["part_number", "manufacturer", "name", "retail", "product"]
rows = []
for node in xroot:
part_number = node.attrib.get("part_number")
manufacturer_name = node.attrib.get("manufacturer_name")
name = node.attrib.get("name")
product = node.findall("product") if node is not None else None
retail = node.findall("retail") if node is not None else None
rows.append({"part_number": part_number, "manufacturer": manufacturer_name, "name": name, "retail": retail, "product": product,})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df.head()
My current output (retail, product come out as blank):
part_number manufacturer ... retail product
0 None None ... [] []
1 19CHPSPWRCH1111111111101 Champ Golf ... [] []
2 19STGTEEMID3CO1111111101 Stinger Tees ... [] []
3 19VEGORIGIN1111111111101 Vegas Golf ... [] []
4 19RAYBALRET1111111111201 Ray Cook Golf ... [] []
My desired output (shortened URL for readibility but the full URL after product):
part_number manufacturer ... retail product
0 None None ... 9.99 https://click.linksynergy.com/link?id=83...
1 19CHPSPWRCH1111111111101 Champ Golf ... 7.99 https://click.linksynergy.com/link?id=83...
2 19STGTEEMID3CO1111111101 Stinger Tees ... 14.99 https://click.linksynergy.com/link?id=83...
3 19VEGORIGIN1111111111101 Vegas Golf ... 19.99 https://click.linksynergy.com/link?id=83...
4 19RAYBALRET1111111111201 Ray Cook Golf ... 6.99 https://click.linksynergy.com/link?id=83...
Any help would be most appreciated!
CodePudding user response:
Assuming XML structure is constant and element/attributes are retrieved by the xpath expression in the same order
from lxml import etree
import pandas as pd
df_cols = ["part_number", "manufacturer", "name", "retail", "product"]
rows = []
tree = etree.parse('/home/luis/tmp/tmp.xml')
root = tree.getroot()
steps = tree.xpath('//product/attribute::*[name()="name" or name()="part_number" or name()="manufacturer_name"] | //product/URL/product/text() | //product/price/retail/text()')
i=0
d=dict()
for s in steps:
if i == 0:
d[df_cols[2]]=s
if i == 1:
d[df_cols[0]]=s
if i == 2:
d[df_cols[1]]=s
if i == 3:
d[df_cols[3]]=s
if i == 4:
d[df_cols[4]]=s
rows.append(d)
i=0
d=dict()
continue
i =1
out_df = pd.DataFrame(rows, columns = df_cols)
print(out_df.head())
Result:
part_number manufacturer name retail product
0 Champ Golf 19CHPSPWRCH1111111111101 Champ Golf- Max Pro Spike Wrench https://click.linksynergy.com/link?id=83wh4zNK... 9.99
1 Stinger Tees 19STGTEEMID3CO1111111101 Stinger Tees- 3" Stinger Pro XL Competition Ca... https://click.linksynergy.com/link?id=83wh4zNK... 7.99
2 Vegas Golf 19VEGORIGIN1111111111101 Vegas Golf- Original Game https://click.linksynergy.com/link?id=83wh4zNK... 14.99
3 Ray Cook Golf 19RAYBALRET1111111111201 Ray Cook Golf- 12' Compact Cup Ball Retriever https://click.linksynergy.com/link?id=83wh4zNK... 19.99
CodePudding user response:
See below
import requests
import xml.etree.ElementTree as ET
import pandas as pd
r = requests.get('https://raw.githubusercontent.com/dgs2021/golfdeals/main/35386_3864840_mp_delta.xml')
attrb_fields = {'manufacturer_name': 'manufacturer','name':'name','part_number':'part_number'}
sub_elements = {'retail':'retail','product':'product'}
root = ET.fromstring(r.content)
data = []
for p in root.findall('product'):
entry = {v:p.attrib.get(k,'NA') for k,v in attrb_fields.items()}
for k,v in sub_elements.items():
e = p.find(f'.//{v}')
entry[k] = e.text if e is not None else 'NA'
data.append(entry)
columns = list(attrb_fields.values()) list(sub_elements.values())
df = pd.DataFrame(data,columns= columns)
print(df)
output
manufacturer ... product
0 Champ Golf ... https://click.linksynergy.com/link?id=83wh4zNK...
1 Stinger Tees ... https://click.linksynergy.com/link?id=83wh4zNK...
2 Vegas Golf ... https://click.linksynergy.com/link?id=83wh4zNK...
3 Ray Cook Golf ... https://click.linksynergy.com/link?id=83wh4zNK...
4 Rock Bottom Golf ... https://click.linksynergy.com/link?id=83wh4zNK...
... ... ... ...
4100 Callaway Golf ... https://click.linksynergy.com/link?id=83wh4zNK...
4101 Cobra Golf ... https://click.linksynergy.com/link?id=83wh4zNK...
4102 Odyssey Golf ... https://click.linksynergy.com/link?id=83wh4zNK...
4103 TaylorMade Golf ... https://click.linksynergy.com/link?id=83wh4zNK...
4104 Titleist Golf ... https://click.linksynergy.com/link?id=83wh4zNK...
[4105 rows x 5 columns]