I have two tables (csv files pulled from a database), one with orders and the second with items, which have a relation to the orders table. I need to build an XML file from these two files to have this kind of structure (simplified due to readability):
<ORDERS>
<ORDER>
<ORDER_ID>11039515178</ORDER_ID>
<CUSTOMER_ID>394556458</CUSTOMER_ID>
<ITEMS>
<ITEM>
<PRODUCT_ID>1401817</PRODUCT_ID>
<AMOUNT>2</AMOUNT>
</ITEM>
<ITEM>
<PRODUCT_ID>1138857</PRODUCT_ID>
<AMOUNT>10</AMOUNT>
</ITEM>
<ITEM>
<PRODUCT_ID>4707595</PRODUCT_ID>
<AMOUNT>15</AMOUNT>
</ITEM>
</ITEMS>
</ORDER>
</ORDERS>
I use this code to generate the XML object. It's striped down to the main structure of the code, so it's easily readable:
import xml.etree.ElementTree as ET
import pandas as pd
order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
order_item = pd.read_csv("order_item.csv", encoding='utf8', keep_default_na=False, dtype=str)
# create XML
xml_orrder = ET.Element('ORDERS')
for row in order.itertuples():
item = ET.SubElement(xml_orrder, 'ORDER')
o_id = ET.Element('ORDER_ID')
o_id.text = row.order_id
item.append(o_id)
customer = ET.Element('CUSTOMER_ID')
customer.text = row.customer_id
item.append(customer)
order_item_id = order_item[order_item['order_id'] == row.order_id]
items = ET.SubElement(item, 'ITEMS')
for order_row in order_item_id.itertuples():
single_item = ET.SubElement(items, 'ITEM')
item_id = ET.Element('PRODUCT_ID')
item_id.text = order_row.product_id
single_item.append(item_id)
quantity = ET.Element('AMOUNT')
quantity.text = order_row.quantity_ordered
single_item.append(quantity)
My problem here is that it runs unbelievably long (around 15 minutes per 1000 orders and each order having like 20 items). I guess I'm doing something wrong here but I'm not able to find out. Is there a way to speed it up? Use another library? I've tried using itertuples() instead of iterrows(). But this wasn't very helpful.
EDIT:
This is how my data looks like:
order = pd.DataFrame({"order_id": range(1000000,1000010,1),
"customer_id": np.random.RandomState(0).randint(1000,2000,10)})
order_item = pd.DataFrame({"order_id": np.random.RandomState(0).randint(1000000,1000010,100),
"product_id": np.random.RandomState(0).randint(1000,2000,100),
"amount": np.random.RandomState(0).randint(1,100,100)})
order_item.sort_values(by="order_id",inplace=True,ignore_index=True)
CodePudding user response:
I'm not sure what your data looks like, so I hope this will work for you, it took me seconds to process ~5000 rows:
import pandas as pd
import lxml.etree as et
df_order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
df_order_item = pd.read_csv("order_items.csv", encoding='utf8', keep_default_na=False, dtype=str)
new_orders = df_order.merge(df_order_item, 'left', left_on='order_id', right_on='order_id')
orders = et.Element('ORDERS')
for order_id in new_orders['order_id'].unique():
rows = new_orders[new_orders['order_id'] == order_id]
customer_id = int(rows['customer_id'].unique())
order = et.SubElement(orders, 'ORDER')
o_id = et.SubElement(order, 'ORDER_ID')
o_id.text = order_id
c_id = et.SubElement(order, 'CUSTOMER_ID')
c_id.text = str(customer_id)
items = et.SubElement(order, 'ITEMS')
for product in rows.itertuples():
item = et.SubElement(items, 'ITEM')
p_id = et.SubElement(item, 'PRODUCT_ID')
p_id.text = product.product_id
amount = et.SubElement(item, 'AMOUNT')
amount.text = product.quantity_ordered
CodePudding user response:
When writing XML or HTML, its frequently faster to write textually rather than adding the expense of building an in-memory XML document. You can write the file directly or use a templating language such as jinja 2. Following is an example using multiline f-strings to write a document with the spacing you want. Since XML doesn't care about newlines or pretty printing, I'd tend to write without the extra spacing.
The code is a little ugly, but that's true for all templating, IMHO.
import pandas as pd
order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
order_item = pd.read_csv("order_item.csv", encoding='utf8', keep_default_na=False, dtype=str)
with open("out.xml", "w") as outfile:
outfile.write("""\
<ORDERS>
""")
for row in order.itertuples():
outfile.write(
f"""\
<ORDER>
<ORDER_ID>{row.order_id}</ORDER_ID>
<CUSTOMER_ID>f{row.customer_id)</CUSOMTER_ID>
""")
outfile.write(f"""\
<ITEMS>
""")
order_item_id = order_item[order_item['order_id'] == row.order_id]
for order_row in order_item_id.itertuples():
outfile.write(f"""\
<ITEM>
<PRODUCT_ID>{order_row.product_id}</PRODUCT_ID>
<AMOUNT>{order_row.quantity_ordered}</AMOUNT>
</ITEM>
"""
outfile.write("""\
</ITEMS>
""")
outfile.write("""\
</ORDERS>
</ORDER>"""