Home > Back-end >  Parsing deeply nested XML into dataframe with python - struggling with deeper elements
Parsing deeply nested XML into dataframe with python - struggling with deeper elements

Time:10-16

I'm attempting to parse out a fairly nested XML file. I've spent the last few hours trying to find a solution with no luck. I'm not sure if the issue is with namespaces, or needing to findall within the loop.

I am able to extract the higher level elements but the deeper nested elements are not being extracted. I am looking to export Part_number, manufacturer_name, name, Product and Retail to a df.

XML sample here (there isn't perfect uniformity across all submissions, some missing fields):

<?xml version="1.0" encoding="UTF-8"?><merchandiser xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="merchandiser.xsd"><header><merchantId>35386</merchantId><merchantName>Rock Bottom Golf</merchantName><createdOn>10/13/2021 14:01:49</createdOn></header>
<product product_id='15' name='Champ Golf- Max Pro Spike Wrench' sku_number='19CHPSPWRCH1111111111101' manufacturer_name='Champ Golf' part_number='19CHPSPWRCH1111111111101'><category><primary>Sporting Goods</primary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.15&amp;type=15&amp;murl=http://www.rockbottomgolf.com/accessories/other/champ-golf-max-pro-spike-wrench/?utm_source=rakuten&utm_medium=cse&utm_term=19CHPSPWRCH1111111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19chpspwrch1111111111101.jpg</productImage></URL><description><short>A convenient and easy to use tool. No more struggling with your spikes. Features: Comfortable contoured soft touch dual density handle Three position ratchet for insertion, removal or lock in place Three bits to fit any spike, all will fit in drills Stand</short><long>A convenient and easy to use tool. No more struggling with your spikes. Features: Comfortable contoured soft touch dual density handle Three position ratchet for insertion, removal or lock in place Three bits to fit any spike, all will fit in drills Stand</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>9.99</retail></price><brand>Champ Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00036504884013</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.15&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>
<product product_id='21' name='Stinger Tees- 3&quot; Stinger Pro XL Competition Camo Mid Pack Poly Bag [125 Count]' sku_number='19STGTEEMID3CO1111111101' manufacturer_name='Stinger Tees' part_number='19STGTEEMID3CO1111111101'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.21&amp;type=15&amp;murl=http://www.rockbottomgolf.com/accessories/tees/stinger-tees-3-stinger-pro-xl-competition-camo-mid-pack-poly-bag-125-count/?utm_source=rakuten&utm_medium=cse&utm_term=19STGTEEMID3CO1111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/3 tees 125 count.jpg</productImage></URL><description><short>Features: Resealable package Less resistance due to a smaller tee head Built to withstand the strongest swings High-quality 120 Tees</short><long>Features: Resealable package Less resistance due to a smaller tee head Built to withstand the strongest swings High-quality 120 Tees</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>7.99</retail></price><brand>Stinger Tees</brand><shipping><availability>in-stock</availability></shipping><upc>00853190005047</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.21&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>
<product product_id='23' name='Vegas Golf- Original Game' sku_number='19VEGORIGIN1111111111101' manufacturer_name='Vegas Golf' part_number='19VEGORIGIN1111111111101'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.23&amp;type=15&amp;murl=http://www.rockbottomgolf.com/accessories/other/vegas-golf-original-game/?utm_source=rakuten&utm_medium=cse&utm_term=19VEGORIGIN1111111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19vegorigin1111111111101.jpg</productImage></URL><description><short>For a limited time only, you&apos;ll get 2 bonus chips with your purchase for a total of 10 game chips! Vegas Golf: the ultimate on-the-course gambling game. Vegas Golf consists of real casino style chips, the object is to avoid the negative and obtain the pos</short><long>For a limited time only, you&apos;ll get 2 bonus chips with your purchase for a total of 10 game chips! Vegas Golf: the ultimate on-the-course gambling game. Vegas Golf consists of real casino style chips, the object is to avoid the negative and obtain the pos</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>14.99</retail></price><brand>Vegas Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00689076007030</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.23&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>
<product product_id='28' name='Ray Cook Golf- 12&apos; Compact Cup Ball Retriever' sku_number='19RAYBALRET1111111111201' manufacturer_name='Ray Cook Golf' part_number='19RAYBALRET1111111111201'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.28&amp;type=15&amp;murl=http://www.rockbottomgolf.com/accessories/ball-retrievers/ray-cook-golf-12-compact-cup-ball-retriever/?utm_source=rakuten&utm_medium=cse&utm_term=19RAYBALRET1111111111201</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19raybalret12.jpg</productImage></URL><description><short>The Ray Cook Golf Ball Retriever extends up to 12 feet and is the perfect companion for every golf bag. Features: Durable construction Telescoping shaft design makes the retriever easy to carry</short><long>The Ray Cook Golf Ball Retriever extends up to 12 feet and is the perfect companion for every golf bag. Features: Durable construction Telescoping shaft design makes the retriever easy to carry</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>19.99</retail></price><brand>Ray Cook Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00840254178410</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.28&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>

I have created the below python code which pulls out part_number, manufacturer_name and name while the other two are elusive.

My code:

import pandas as pd 
import xml.etree.ElementTree as et 

xtree = et.parse(r"file.xml")
xroot = xtree.getroot() 

df_cols = ["part_number", "manufacturer", "name", "retail", "product"]
rows = []

for node in xroot: 
    part_number = node.attrib.get("part_number")
    manufacturer_name = node.attrib.get("manufacturer_name")
    name = node.attrib.get("name")  
    product = node.findall("product") if node is not None else None
    retail = node.findall("retail") if node is not None else None

    rows.append({"part_number": part_number, "manufacturer": manufacturer_name, "name": name, "retail": retail, "product": product,})


out_df = pd.DataFrame(rows, columns = df_cols)

out_df.head()

My current output (retail, product come out as blank):

                part_number   manufacturer  ... retail product
0                      None           None  ...     []      []
1  19CHPSPWRCH1111111111101     Champ Golf  ...     []      []
2  19STGTEEMID3CO1111111101   Stinger Tees  ...     []      []
3  19VEGORIGIN1111111111101     Vegas Golf  ...     []      []
4  19RAYBALRET1111111111201  Ray Cook Golf  ...     []      []

My desired output (shortened URL for readibility but the full URL after product):

                part_number   manufacturer  ... retail product
0                      None           None  ...     9.99     https://click.linksynergy.com/link?id=83...
1  19CHPSPWRCH1111111111101     Champ Golf  ...     7.99      https://click.linksynergy.com/link?id=83...
2  19STGTEEMID3CO1111111101   Stinger Tees  ...     14.99      https://click.linksynergy.com/link?id=83...
3  19VEGORIGIN1111111111101     Vegas Golf  ...     19.99      https://click.linksynergy.com/link?id=83...
4  19RAYBALRET1111111111201  Ray Cook Golf  ...     6.99      https://click.linksynergy.com/link?id=83...

Any help would be most appreciated!

CodePudding user response:

Assuming XML structure is constant and element/attributes are retrieved by the xpath expression in the same order

from lxml import etree
import pandas as pd

df_cols = ["part_number", "manufacturer", "name", "retail", "product"]
rows = []
tree = etree.parse('/home/luis/tmp/tmp.xml')
root = tree.getroot()
steps = tree.xpath('//product/attribute::*[name()="name" or name()="part_number" or name()="manufacturer_name"] | //product/URL/product/text() | //product/price/retail/text()')
i=0
d=dict()
for s in steps:

    if i == 0:
        d[df_cols[2]]=s
    if i == 1:
        d[df_cols[0]]=s
    if i == 2:
        d[df_cols[1]]=s
    if i == 3:
        d[df_cols[3]]=s
    if i == 4:
        d[df_cols[4]]=s
        rows.append(d)
        i=0
        d=dict()
        continue
    i =1


out_df = pd.DataFrame(rows, columns = df_cols)

print(out_df.head())

Result:

     part_number              manufacturer                                               name                                             retail product
0     Champ Golf  19CHPSPWRCH1111111111101                   Champ Golf- Max Pro Spike Wrench  https://click.linksynergy.com/link?id=83wh4zNK...    9.99
1   Stinger Tees  19STGTEEMID3CO1111111101  Stinger Tees- 3" Stinger Pro XL Competition Ca...  https://click.linksynergy.com/link?id=83wh4zNK...    7.99
2     Vegas Golf  19VEGORIGIN1111111111101                          Vegas Golf- Original Game  https://click.linksynergy.com/link?id=83wh4zNK...   14.99
3  Ray Cook Golf  19RAYBALRET1111111111201      Ray Cook Golf- 12' Compact Cup Ball Retriever  https://click.linksynergy.com/link?id=83wh4zNK...   19.99

CodePudding user response:

See below

import requests
import xml.etree.ElementTree as ET
import pandas as pd

r = requests.get('https://raw.githubusercontent.com/dgs2021/golfdeals/main/35386_3864840_mp_delta.xml')
attrb_fields =  {'manufacturer_name': 'manufacturer','name':'name','part_number':'part_number'}
sub_elements = {'retail':'retail','product':'product'}

root = ET.fromstring(r.content)

data = []
for p in root.findall('product'):
  entry = {v:p.attrib.get(k,'NA') for k,v in attrb_fields.items()}
  for k,v in sub_elements.items():
    e = p.find(f'.//{v}')
    entry[k] = e.text if e is not None else 'NA'
  data.append(entry)
columns = list(attrb_fields.values())   list(sub_elements.values())
df = pd.DataFrame(data,columns= columns)
print(df)

output

          manufacturer  ...                                            product
0           Champ Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
1         Stinger Tees  ...  https://click.linksynergy.com/link?id=83wh4zNK...
2           Vegas Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
3        Ray Cook Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4     Rock Bottom Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
...                ...  ...                                                ...
4100     Callaway Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4101        Cobra Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4102      Odyssey Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4103   TaylorMade Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4104     Titleist Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...

[4105 rows x 5 columns]
  • Related