Using Pandas DataFrame.to_xml without row or root element-CodePudding

I got a XSD File looking like this:

<?xml version="1.0" encoding="utf-8" ?>
<xs:schema version="1.0"
           xmlns:xs="http://www.w3.org/2001/XMLSchema"
           elementFormDefault="qualified">   
    <xs:element name="TEST">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="Content1" type="xs:integer"/> 
                <xs:element name="Content2" type="xs:string" />
                <xs:element name="Content3" type="xs:string"/>
                <xs:element name="Content4" type="xs:string" />
                <xs:element name="Content5" type="xs:string" minOccurs="0"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

I want to export each line of my data frame into a seperate xml file using df.to_xml(). Using

for row in range(df.shape[0]):  
  df1 = df.iloc[row:row 1]
  df1.to_xml(f"{base_path}/{filename}", root_name="TEST", index=False)

Currently it looks like this:

<?xml version='1.0' encoding='utf-8'?>
<TEST>
  <row>
    <Content1>123</Content1>
    <Content2>abc</Content2>
    <Content3>242136</Content3>
    <Content4>90°</Content4>
  </row>
</TEST>

My problem are the lines <row> and </row>. How can I prevent them to be created? Alternative I could give the row the name TEST and prevent the root lines to be created if this is possible.

But DataFrame.to_xml creates a root and a row element. I need only one of them. How does my output contain only one of them?

CodePudding user response：

Another option would be using an XSLT stylesheet:

xslt = '''<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes" method="xml" />
    
    <xsl:template match="row">
        <xsl:apply-templates select="*"/>
    </xsl:template>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>
'''

for row in range(df.shape[0]):  
  df1 = df.iloc[row:row 1]
  df1.to_xml(f"{base_path}/{filename}", root_name="TEST", index=False, stylesheet=xslt)

It is applied to the resulting XML file and copies all the elements but "row".

CodePudding user response：

Consider parsing the entire data frame to XML and then iteratively remove the child elements with lxml (which you do have installed being the default parser of read_xml and to_xml). Notice the use of row_name argument. Below loop uses enumerate for file naming.

import lxml.etree as lx
...

data = lx.fromstring(df.to_xml(row_name="TEST", index=False))

for n, test in enumerate(data.xpath("//TEST"), start=1):
    xmlfile = os.path.join(base_path, f"TEST_{n}.xml")
    lx.ElementTree(test).write(xmlfile)