Python XML/Pandas: How to merge nested XML?-CodePudding

How can I join two different pieces of information together from this XML file?

# data
xml1 = ('''<?xml version="1.0" encoding="utf-8"?>
<TopologyDefinition xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <RSkus>
    <RSku ID="V1" Deprecated="true" Owner="Unknown" Generation="1">
      <Devices>
        <Device ID="1" SkuID="Switch" Role="xD" />
      </Devices>
      <Blades>
        <Blade ID="{1-20}" SkuID="SBlade" />
      </Blades>
      <Interfaces>
        <Interface ID="COM" HardwareID="NS1" SlotID="COM1" Type="serial" />
        <Interface ID="LINK" HardwareID="TS1" SlotID="UPLINK_1" Type="serial" />
      </Interfaces>
      <Wires>
        <WireGroup Type="network">
          <Wire LocationA="NS1" SlotA="{1-20}" LocationB="{1-20}" SlotB="NIC1" />
        </WireGroup>
        <WireGroup Type="serial">
          <Wire LocationA="TS1" SlotA="{7001-7020}" LocationB="{1-20}" SlotB="COM1" />
        </WireGroup>
      </Wires>
    </RSku>
  </RSkus>
</TopologyDefinition>
''')

While this is a single case and trivial in the instance below; if I run the below commands on the full file, I get shapes that do not match and therefore cannot be joined so easily.

How can I extract the XML information such that for every row, I get all the RSku information PLUS its Blade information. Each xpath contains no information that would let me join it to another xpath so that I may combine the information.

# how to have them joined?
pd.read_xml(xml1, xpath = ".//RSku")
pd.read_xml(xml1, xpath = ".//Blade")

# expected
pd.concat([pd.read_xml(xml1, xpath = ".//RSku"), pd.read_xml(xml1, xpath = ".//Blade")], axis=1)

CodePudding user response：

Consider transforming the XML with XSLT by flattening the document with information you need. Specifically, retrieve only Blade attributes using descendant::* axis and corresponding RSku attributes using the ancestor::* axis. Python' lxml (default parser of pandas.read_xml) can run XSLT 1.0 scripts.

Below XSLT's <xsl:for-each> is used to prefix RSku_ and Blade_ to attribute names since they share same attribute such as ID. Otherwise template would be much less wordy.

import pandas as pd

xml1 = ...

xsl = ('''<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
                version="1.0">  
  <xsl:output method="xml" indent="yes"/>

  <xsl:template match="/TopologyDefinition">
    <root>
        <xsl:apply-templates select="descendant::Blade"/>
    </root>
  </xsl:template>
  
  <xsl:template match="Blade">
    <data>
      <xsl:for-each select="ancestor::RSku/@*">
        <xsl:attribute name="{concat('RSku_', name())}">
          <xsl:value-of select="."/>
        </xsl:attribute>
      </xsl:for-each>
      <xsl:for-each select="@*">
        <xsl:attribute name="{concat('Blade_', name())}">
          <xsl:value-of select="."/>
        </xsl:attribute>
      </xsl:for-each>
    </data>
  </xsl:template>
  
</xsl:stylesheet>''')

blades_df = pd.read xml(xml1, stylesheet=xsl)

Online XSLT Demo