Home > Software engineering >  SQL Server 2012 XML Flatten and reduce fan out / duplicates
SQL Server 2012 XML Flatten and reduce fan out / duplicates

Time:12-11

I'm having some issues with trying to get the some XML data (stored as text in an old MS SQL Server 2012) parsed and into a usable format.

XML data is a string, but when I convert it to XML, it look like this:

<?xml version="1.0" encoding="utf-8"?>
  <header1>
    <header2>
      <OrderFormHeader>
        <AccountNum>123456</AccountNum>
        <OrderNum>000123987</OrderNum>
        <OrderDetails>
          <CompanyName>Biznez1</CompanyName>
          <CompAddressInfo>
            <City>Phoenix</City>
            <State>AZ</State>
          </CompAddressInfo>
          <ShipTo>TRUE</ShipTo>
          <BillTo>FALSE</BillTo>
        </OrderDetails>
      </OrderFormHeader>
      <OrderFormDetails>
        <OrderFormLines>
          <ItemNum>000001</ItemNum>
          <InventoryNum>INV-001-000001</InventoryNum>
          <OtherDetails>
            <QtyOrdered>1</QtyOrdered>
            <ItemDesc>Bandaids</ItemDesc>
            <UnitofMeasure>Box</UnitofMeasure>
            <ItemCode>
              <CodeType>UPC</CodeType>
              <CodeID>123456789123</CodeID>
            </ItemCode>
          <OtherDetails>
        </OrderFormLines>
      </OrderFormDetails>
        <OrderFormLines>
          <ItemNum>000002</ItemNum>
          <InventoryNum>INV-001-000002</InventoryNum>
          <OtherDetails>
            <QtyOrdered>1</QtyOrdered>
            <ItemDesc>QTips</ItemDesc>
            <UnitofMeasure>Box</UnitofMeasure>
            <ItemCode>
              <CodeType>UPC</CodeType>
              <CodeID>123456789987</CodeID>
            </ItemCode>
          <OtherDetails>
        </OrderFormLines>
        <OrderFormLines>
          <ItemNum>000003</ItemNum>
          <InventoryNum>INV-003-000001</InventoryNum>
          <OtherDetails>
            <QtyOrdered>1</QtyOrdered>
            <ItemDesc>Scissors</ItemDesc>
            <UnitofMeasure>Each</UnitofMeasure>
            <ItemCode>
              <CodeType>UPC</CodeType>
              <CodeID>123456987321</CodeID>
            </ItemCode>
          <OtherDetails>
        </OrderFormLines>
      </header2>
    </header1>

Needless to say, it's a crazy XML (at least to me). (Note: There are multiple sets of OrderFormDetails nested within the object and parsing them via my code seems to fan out on the ItemNum and InventoryNum. I've removed the UPC code stuff as that was causing additional fan out, but wouldn't mind bringing that back into my code)

With that said, my current SQL code uses a table variable to take the data from the table, correct the UTF-8 and put it into an XML format. From there, I use the CROSS APPLY functions to get the data out, but it has severe fan-out issues where it will show the data multiple times rather than just 1 row each:

DECLARE @xml TABLE (IMPORTED_XML xml)
INSERT INTO @xml

SELECT
  CAST(REPLACE(mxt.XML_FIELD,'encoding="UTF-8"','encoding="UTF-16"') AS XML) AS IMPORTED_XML
FROM MyXMLTable as mxt

with temp1 AS (
  SELECT DISTINCT
      sales_order.value('(./AccountNum/text())[1]','nvarchar(max)')                                                                       AS ACCOUNT_NUM
    , sales_order.value('(./OrderNum/text())[1]','nvarchar(max)')                                                              AS ORDER_NUM
    , extra_so.value('(./CompanyName/text())[1]','nvarchar(max)')                                                      AS COMPANY_NAME
    , base.value('(./ItemNum/text())[1]','nvarchar(max)')                                                                                      AS ITEM_ID
    , base.value('(./InventoryNum/text())[1]','nvarchar(max)')                                                                                 AS INVENTORY_NUM
    , sales.value('(./QtyOrdered/text())[1]','nvarchar(max)')                                                                                AS QTY_ORDERED
    , sales.value('(./UnitofMeasure/text())[1]','nvarchar(max)')                                                                                     AS ITEM_UOM
    , sales.value('(./ItemDesc/text())[1]','nvarchar(max)')                                                                            AS ITEM_DESC

FROM @xml
  CROSS APPLY IMPORTED_XML.nodes('/header1/header2') AS core(core)
  CROSS APPLY core.nodes('//OrderFormDetails/OrderFormLines') as base(base)
  CROSS APPLY core.nodes('//OrderFormHeader') AS sales_order(sales_order)
  CROSS APPLY base.nodes('//OtherDetails') as sales(sales)
  CROSS APPLY sales_order.nodes('//OrderDetails') AS extra_so(extra_so)
  CROSS APPLY sales.nodes('//ItemCode') as itmcode(itmcode)
)

select * from temp1 order by item_desc asc

This seems to mostly work, but it ends up with multiple rows of data for the same stuff... I'm used to using the lateral flatten function in Snowflake, but not this XML parsing in SQL Server 2012. Any insight into this? Thank you in advance for your help

CodePudding user response:

Your issue is that you are cross-joining each nested node all the way back from the root, because you are using //.

There are other points to note:

  • You don't need temporary tables, you can CROSS APPLY everything together in one query
  • You don't need REPLACE if the column is already varchar, only if it's nvarchar.
  • You don't need to use .nodes on every level of nesting, you only need it if you want multiple items from a single level.
  • Pick your data types carefully, does everything have to be nvarchar(max)?
SELECT 
    sales_order.value('(AccountNum/text())[1]','varchar(50)') AS ACCOUNT_NUM
  , sales_order.value('(OrderNum/text())[1]','varchar(50)') AS ORDER_NUM
  , sales_order.value('(OrderDetails/CompanyName/text())[1]','nvarchar(200)') AS COMPANY_NAME
  , base.value('(ItemNum/text())[1]','varchar(50)')  AS ITEM_ID
  , base.value('(InventoryNum/text())[1]','varchar(50)') AS INVENTORY_NUM
  , sales.value('(QtyOrdered/text())[1]','int')  AS QTY_ORDERED
  , sales.value('(UnitofMeasure/text())[1]','varchar(20)') AS ITEM_UOM
  , sales.value('(ItemDesc/text())[1]','nvarchar(max)') AS ITEM_DESC
  , itmcode.value('(CodeType/text())[1]','varchar(20)') AS itemcodetype
  , itmcode.value('(CodeID/text())[1]','varchar(50)') AS itemcodeID
      

FROM MyXMLTable as mxt
CROSS APPLY (VALUES( CAST(REPLACE(mxt.XML_FIELD,'encoding="UTF-8"','encoding="UTF-16"') AS xml) )) v(IMPORTED_XML)
CROSS APPLY IMPORTED_XML.nodes('/header1/header2') AS core(core)
CROSS APPLY core.nodes('OrderFormHeader') AS sales_order(sales_order)
CROSS APPLY core.nodes('OrderFormDetails/OrderFormLines') as base(base)
CROSS APPLY base.nodes('OtherDetails') as sales(sales)
CROSS APPLY sales.nodes('ItemCode') as itmcode(itmcode);

db<>fiddle

  • Related