I am new to python and looking for a solution to the following:
I have a file.xml that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<HEADER>
<PRODUCT_DETAILS>
<DESCRIPTION_SHORT>green cat w short hair</DESCRIPTION_SHORT>
<DESCRIPTION_LONG>green cat w short hair and unlimited zoomies</DESCRIPTION_LONG>
</PRODUCT_DETAILS>
<PRODUCT_FEATURES>
<FEATURE>
<FNAME>Colour</FNAME>
<FVALUE>green</FVALUE>
</FEATURE>
<FEATURE>
<FNAME>Legs</FNAME>
<FVALUE>14</FVALUE>
</FEATURE>
</PRODUCT_FEATURES>
<PRODUCT_DETAILS>
<DESCRIPTION_SHORT>blue dog w no tail</DESCRIPTION_SHORT>
<DESCRIPTION_LONG>blue dog w no tail and unlimited zoomies</DESCRIPTION_LONG>
</PRODUCT_DETAILS>
<PRODUCT_FEATURES>
<FEATURE>
<FNAME>Colour</FNAME>
<FVALUE>blue</FVALUE>
</FEATURE>
<FEATURE>
<FNAME>Happiness Levels</FNAME>
<FVALUE>11/10</FVALUE>
</FEATURE>
</PRODUCT_FEATURES>
</HEADER>
This is my code:
from lxml import etree as et
import pandas as pd
xml_data = et.parse('file2.xml')
products = xml_data.xpath('//HEADER')
headers=[elem.tag for elem in xml_data.xpath('//HEADER[1]//PRODUCT_DETAILS//*')]
headers.extend(xml_data.xpath('//HEADER[1]//FNAME/text()'))
rows = []
for product in products:
row = [product.xpath(f'.//{headers[0]}/text()')[0],product.xpath(f'.//{headers[1]}/text()')[0]]
f_values = product.xpath('.//FVALUE/text()')
row.extend(f_values)
rows.append(row)
df = pd.DataFrame(rows,columns=headers)
df
# df.to_csv("File2_Export_V1.csv", index=False)
This is the output I want:
DESCRIPTION_SHORT DESCRIPTION_LONG Colour Legs Happiness Levels
0 green cat w short hair green cat w short hair and unlimited zoomies green 14
1 blue dog w no tail blue dog w no tail and unlimited zoomies blue 11/10
My attempt at solving this problem was to extend one line like this:
headers=[elem.tag for elem in xml_data.xpath('//HEADER[1]//PRODUCT_DETAILS//*'),('//HEADER[2]//PRODUCT_DETAILS//*')]
Sadly, I get a syntax error and no solution.
How do I adapt my code to reflect the xml structure?
Thank you in advance! ~C
CodePudding user response:
Probably not the best solution, but I think it's pretty straight-forward, and clear.
import xml.etree.ElementTree as ET
import pandas as pd
# Get xml object
tree = ET.parse('file2.xml')
root = tree.getroot()
# Create final DataFrame
out = pd.DataFrame()
# Loop over all products (Product = (DETAILS, FEATURES))
for i in range(0, len(root), 2):
# Get all descriptions
descriptions = [(child.tag, child.text) for child in root[i]]
# Get all features
features = [(child[0].text, child[1].text) for child in root[i 1]]
# Create a DataFrame, where columns are the tags, and values are, well, values
temp_df = pd.DataFrame([[i[1] for i in descriptions features]], columns=[i[0] for i in descriptions features])
# Append to final DataFrame
out = pd.concat([out, temp_df])