Home > Back-end >  How to normalize the sequence of XML elements and attributes?
How to normalize the sequence of XML elements and attributes?

Time:11-15

I'm version controlling a bunch of XML files which are generated by third party applications. Unfortunately the files are often saved in a way which makes version control more cumbersome than it should be. They might swap the elements around:

 <root>
-    <b>bar</b>
     <a>foo</a>
     <b>bar</b>
 </root>

or reorder attributes:

-<root a="foo" b="bar"/>
 <root b="bar" a="foo"/>

or change/remove indentation:

-<root a="foo" b="bar"/>
 <root
   a="foo"
   b="bar"/>

To be clear, these files do not mix text and element nodes (like <a>foo <b>bar</b></a>), and there's no semantic difference between the differently ordered files, so it's safe to reorder them any way we want.

I've solved this partially by using xsltproc and the following schema to sort elements:

<stylesheet version="1.0" xmlns="http://www.w3.org/1999/XSL/Transform">
    <output method="xml" indent="yes" encoding="UTF-8"/>
    <strip-space elements="*"/>

    <template match="processing-instruction()|@*">
        <copy>
            <apply-templates select="node()|@*"/>
        </copy>
    </template>

    <template match="*">
        <copy>
            <apply-templates select="@*"/>
            <apply-templates>
                <sort select="name()"/>
                <sort select="@*[1]"/>
                <sort select="@*[2]"/>
                <sort select="@*[3]"/>
                <sort select="@*[4]"/>
                <sort select="@*[5]"/>
                <sort select="@*[6]"/>
            </apply-templates>
        </copy>
    </template>
</stylesheet>

However, I've recently learned that attribute ordering is not defined, so ordering by the six "first" attributes won't work in general. And of course this doesn't sort the attributes.

(I've used "normalize" in the title because I don't necessarily want to sort the elements in some particular way, it just seemed like the most obvious way to make sure the textual difference between two semantically identical files is empty.)

Is there some way to achieve such ordering?

Despite the name, this is different from XSLT sort by tag name and attribute value. The question includes only a single attribute, and the accepted solution isn't sufficiently general.

CodePudding user response:

The purpose of this exercise is not entirely clear. If you just want to "normalize" (canonicalize?) different documents so that the elements and their attributes appear in the same order (and indentation), you could do simply:

XSLT 1.0

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*">
            <xsl:sort select="name()"/>
        </xsl:apply-templates>
        <xsl:apply-templates select="node()">
            <xsl:sort select="name()"/>
        </xsl:apply-templates>
    </xsl:copy>
</xsl:template>

</xsl:stylesheet>

When this is applied to the following inputs:

XML 1

<input shape="circle" size="large" color="blue">
    <shape>circle</shape>
    <size>large</size>
    <color>blue</color>
</input>

XML 2

<input color="red" size="small" shape="square">
    <color>red</color>
    <size>small</size>
    <shape>square</shape>
</input>

the results will be respectively:

Result 1

<?xml version="1.0" encoding="UTF-8"?>
<input color="blue" shape="circle" size="large">
  <color>blue</color>
  <shape>circle</shape>
  <size>large</size>
</input>

Result 2

<?xml version="1.0" encoding="UTF-8"?>
<input color="red" shape="square" size="small">
  <color>red</color>
  <shape>square</shape>
  <size>small</size>
</input>

Note:
Since the order of attributes is by definition insignificant, an XSLT processor is not obligated to follow the instruction to sort them. Hower, in practice most processors will.

  • Related