Home > OS >  Using Pandas DataFrame.to_xml without row or root element
Using Pandas DataFrame.to_xml without row or root element

Time:09-27

I got a XSD File looking like this:

<?xml version="1.0" encoding="utf-8" ?>
<xs:schema version="1.0"
           xmlns:xs="http://www.w3.org/2001/XMLSchema"
           elementFormDefault="qualified">   
    <xs:element name="TEST">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="Content1" type="xs:integer"/> 
                <xs:element name="Content2" type="xs:string" />
                <xs:element name="Content3" type="xs:string"/>
                <xs:element name="Content4" type="xs:string" />
                <xs:element name="Content5" type="xs:string" minOccurs="0"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

I want to export each line of my data frame into a seperate xml file using df.to_xml(). Using

for row in range(df.shape[0]):  
  df1 = df.iloc[row:row 1]
  df1.to_xml(f"{base_path}/{filename}", root_name="TEST", index=False)

Currently it looks like this:

<?xml version='1.0' encoding='utf-8'?>
<TEST>
  <row>
    <Content1>123</Content1>
    <Content2>abc</Content2>
    <Content3>242136</Content3>
    <Content4>90°</Content4>
  </row>
</TEST>

My problem are the lines <row> and </row>. How can I prevent them to be created? Alternative I could give the row the name TEST and prevent the root lines to be created if this is possible.

But DataFrame.to_xml creates a root and a row element. I need only one of them. How does my output contain only one of them?

CodePudding user response:

Another option would be using an XSLT stylesheet:

xslt = '''<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes" method="xml" />
    
    <xsl:template match="row">
        <xsl:apply-templates select="*"/>
    </xsl:template>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>
'''

for row in range(df.shape[0]):  
  df1 = df.iloc[row:row 1]
  df1.to_xml(f"{base_path}/{filename}", root_name="TEST", index=False, stylesheet=xslt)

It is applied to the resulting XML file and copies all the elements but "row".

CodePudding user response:

Consider parsing the entire data frame to XML and then iteratively remove the child elements with lxml (which you do have installed being the default parser of read_xml and to_xml). Notice the use of row_name argument. Below loop uses enumerate for file naming.

import lxml.etree as lx
...

data = lx.fromstring(df.to_xml(row_name="TEST", index=False))

for n, test in enumerate(data.xpath("//TEST"), start=1):
    xmlfile = os.path.join(base_path, f"TEST_{n}.xml")
    lx.ElementTree(test).write(xmlfile)
  • Related