Fastest way to generate XML in Python-CodePudding

I have two tables (csv files pulled from a database), one with orders and the second with items, which have a relation to the orders table. I need to build an XML file from these two files to have this kind of structure (simplified due to readability):

<ORDERS>
    <ORDER>
        <ORDER_ID>11039515178</ORDER_ID>
        <CUSTOMER_ID>394556458</CUSTOMER_ID>
        <ITEMS>
            <ITEM>
                <PRODUCT_ID>1401817</PRODUCT_ID>
                <AMOUNT>2</AMOUNT>
            </ITEM>
            <ITEM>
                <PRODUCT_ID>1138857</PRODUCT_ID>
                <AMOUNT>10</AMOUNT>
            </ITEM>
            <ITEM>
                <PRODUCT_ID>4707595</PRODUCT_ID>
                <AMOUNT>15</AMOUNT>
            </ITEM>
        </ITEMS>
    </ORDER>
</ORDERS>

I use this code to generate the XML object. It's striped down to the main structure of the code, so it's easily readable:

import xml.etree.ElementTree as ET
import pandas as pd

order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
order_item = pd.read_csv("order_item.csv", encoding='utf8', keep_default_na=False, dtype=str)

# create XML
xml_orrder = ET.Element('ORDERS')
for row in order.itertuples():
    item = ET.SubElement(xml_orrder, 'ORDER')

    o_id = ET.Element('ORDER_ID')
    o_id.text = row.order_id
    item.append(o_id)

    customer = ET.Element('CUSTOMER_ID')
    customer.text = row.customer_id
    item.append(customer)

    order_item_id = order_item[order_item['order_id'] == row.order_id]

    items = ET.SubElement(item, 'ITEMS')
    for order_row in order_item_id.itertuples():
        single_item = ET.SubElement(items, 'ITEM')

        item_id = ET.Element('PRODUCT_ID')
        item_id.text = order_row.product_id
        single_item.append(item_id)

        quantity = ET.Element('AMOUNT')
        quantity.text = order_row.quantity_ordered
        single_item.append(quantity)

My problem here is that it runs unbelievably long (around 15 minutes per 1000 orders and each order having like 20 items). I guess I'm doing something wrong here but I'm not able to find out. Is there a way to speed it up? Use another library? I've tried using itertuples() instead of iterrows(). But this wasn't very helpful.

EDIT:

This is how my data looks like:

order = pd.DataFrame({"order_id": range(1000000,1000010,1),
                         "customer_id": np.random.RandomState(0).randint(1000,2000,10)})

order_item = pd.DataFrame({"order_id": np.random.RandomState(0).randint(1000000,1000010,100),
                         "product_id": np.random.RandomState(0).randint(1000,2000,100),
                         "amount": np.random.RandomState(0).randint(1,100,100)})
order_item.sort_values(by="order_id",inplace=True,ignore_index=True)

CodePudding user response：

I'm not sure what your data looks like, so I hope this will work for you, it took me seconds to process ~5000 rows:

import pandas as pd
import lxml.etree as et

df_order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
df_order_item = pd.read_csv("order_items.csv", encoding='utf8', keep_default_na=False, dtype=str)

new_orders = df_order.merge(df_order_item, 'left', left_on='order_id', right_on='order_id')

orders = et.Element('ORDERS')
for order_id in new_orders['order_id'].unique():
    rows = new_orders[new_orders['order_id'] == order_id]
    customer_id = int(rows['customer_id'].unique())
    order = et.SubElement(orders, 'ORDER')
    o_id = et.SubElement(order, 'ORDER_ID')
    o_id.text = order_id
    c_id = et.SubElement(order, 'CUSTOMER_ID')
    c_id.text = str(customer_id)
    items = et.SubElement(order, 'ITEMS')
    for product in rows.itertuples():
        item = et.SubElement(items, 'ITEM')
        p_id = et.SubElement(item, 'PRODUCT_ID')
        p_id.text = product.product_id
        amount = et.SubElement(item, 'AMOUNT')
        amount.text = product.quantity_ordered

CodePudding user response：

When writing XML or HTML, its frequently faster to write textually rather than adding the expense of building an in-memory XML document. You can write the file directly or use a templating language such as jinja 2. Following is an example using multiline f-strings to write a document with the spacing you want. Since XML doesn't care about newlines or pretty printing, I'd tend to write without the extra spacing.

The code is a little ugly, but that's true for all templating, IMHO.

import pandas as pd

order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
order_item = pd.read_csv("order_item.csv", encoding='utf8', keep_default_na=False, dtype=str)

with open("out.xml", "w") as outfile:
    outfile.write("""\
<ORDERS>
""")
    for row in order.itertuples():
        outfile.write(
f"""\
    <ORDER>
        <ORDER_ID>{row.order_id}</ORDER_ID>
        <CUSTOMER_ID>f{row.customer_id)</CUSOMTER_ID>
""")

        outfile.write(f"""\
        <ITEMS>
""")

        order_item_id = order_item[order_item['order_id'] == row.order_id]
        for order_row in order_item_id.itertuples():
            outfile.write(f"""\
            <ITEM>
                <PRODUCT_ID>{order_row.product_id}</PRODUCT_ID>
                <AMOUNT>{order_row.quantity_ordered}</AMOUNT>
            </ITEM>
"""
        outfile.write("""\
        </ITEMS>
""")
    outfile.write("""\
    </ORDERS>
</ORDER>"""