I'm having some issues with trying to get the some XML data (stored as text in an old MS SQL Server 2012) parsed and into a usable format.
XML data is a string, but when I convert it to XML, it look like this:
<?xml version="1.0" encoding="utf-8"?>
<header1>
<header2>
<OrderFormHeader>
<AccountNum>123456</AccountNum>
<OrderNum>000123987</OrderNum>
<OrderDetails>
<CompanyName>Biznez1</CompanyName>
<CompAddressInfo>
<City>Phoenix</City>
<State>AZ</State>
</CompAddressInfo>
<ShipTo>TRUE</ShipTo>
<BillTo>FALSE</BillTo>
</OrderDetails>
</OrderFormHeader>
<OrderFormDetails>
<OrderFormLines>
<ItemNum>000001</ItemNum>
<InventoryNum>INV-001-000001</InventoryNum>
<OtherDetails>
<QtyOrdered>1</QtyOrdered>
<ItemDesc>Bandaids</ItemDesc>
<UnitofMeasure>Box</UnitofMeasure>
<ItemCode>
<CodeType>UPC</CodeType>
<CodeID>123456789123</CodeID>
</ItemCode>
<OtherDetails>
</OrderFormLines>
</OrderFormDetails>
<OrderFormLines>
<ItemNum>000002</ItemNum>
<InventoryNum>INV-001-000002</InventoryNum>
<OtherDetails>
<QtyOrdered>1</QtyOrdered>
<ItemDesc>QTips</ItemDesc>
<UnitofMeasure>Box</UnitofMeasure>
<ItemCode>
<CodeType>UPC</CodeType>
<CodeID>123456789987</CodeID>
</ItemCode>
<OtherDetails>
</OrderFormLines>
<OrderFormLines>
<ItemNum>000003</ItemNum>
<InventoryNum>INV-003-000001</InventoryNum>
<OtherDetails>
<QtyOrdered>1</QtyOrdered>
<ItemDesc>Scissors</ItemDesc>
<UnitofMeasure>Each</UnitofMeasure>
<ItemCode>
<CodeType>UPC</CodeType>
<CodeID>123456987321</CodeID>
</ItemCode>
<OtherDetails>
</OrderFormLines>
</header2>
</header1>
Needless to say, it's a crazy XML (at least to me). (Note: There are multiple sets of OrderFormDetails nested within the object and parsing them via my code seems to fan out on the ItemNum and InventoryNum. I've removed the UPC code stuff as that was causing additional fan out, but wouldn't mind bringing that back into my code)
With that said, my current SQL code uses a table variable to take the data from the table, correct the UTF-8 and put it into an XML format. From there, I use the CROSS APPLY functions to get the data out, but it has severe fan-out issues where it will show the data multiple times rather than just 1 row each:
DECLARE @xml TABLE (IMPORTED_XML xml)
INSERT INTO @xml
SELECT
CAST(REPLACE(mxt.XML_FIELD,'encoding="UTF-8"','encoding="UTF-16"') AS XML) AS IMPORTED_XML
FROM MyXMLTable as mxt
with temp1 AS (
SELECT DISTINCT
sales_order.value('(./AccountNum/text())[1]','nvarchar(max)') AS ACCOUNT_NUM
, sales_order.value('(./OrderNum/text())[1]','nvarchar(max)') AS ORDER_NUM
, extra_so.value('(./CompanyName/text())[1]','nvarchar(max)') AS COMPANY_NAME
, base.value('(./ItemNum/text())[1]','nvarchar(max)') AS ITEM_ID
, base.value('(./InventoryNum/text())[1]','nvarchar(max)') AS INVENTORY_NUM
, sales.value('(./QtyOrdered/text())[1]','nvarchar(max)') AS QTY_ORDERED
, sales.value('(./UnitofMeasure/text())[1]','nvarchar(max)') AS ITEM_UOM
, sales.value('(./ItemDesc/text())[1]','nvarchar(max)') AS ITEM_DESC
FROM @xml
CROSS APPLY IMPORTED_XML.nodes('/header1/header2') AS core(core)
CROSS APPLY core.nodes('//OrderFormDetails/OrderFormLines') as base(base)
CROSS APPLY core.nodes('//OrderFormHeader') AS sales_order(sales_order)
CROSS APPLY base.nodes('//OtherDetails') as sales(sales)
CROSS APPLY sales_order.nodes('//OrderDetails') AS extra_so(extra_so)
CROSS APPLY sales.nodes('//ItemCode') as itmcode(itmcode)
)
select * from temp1 order by item_desc asc
This seems to mostly work, but it ends up with multiple rows of data for the same stuff... I'm used to using the lateral flatten function in Snowflake, but not this XML parsing in SQL Server 2012. Any insight into this? Thank you in advance for your help
CodePudding user response:
Your issue is that you are cross-joining each nested node all the way back from the root, because you are using //
.
There are other points to note:
- You don't need temporary tables, you can
CROSS APPLY
everything together in one query - You don't need
REPLACE
if the column is alreadyvarchar
, only if it'snvarchar
. - You don't need to use
.nodes
on every level of nesting, you only need it if you want multiple items from a single level. - Pick your data types carefully, does everything have to be
nvarchar(max)
?
SELECT
sales_order.value('(AccountNum/text())[1]','varchar(50)') AS ACCOUNT_NUM
, sales_order.value('(OrderNum/text())[1]','varchar(50)') AS ORDER_NUM
, sales_order.value('(OrderDetails/CompanyName/text())[1]','nvarchar(200)') AS COMPANY_NAME
, base.value('(ItemNum/text())[1]','varchar(50)') AS ITEM_ID
, base.value('(InventoryNum/text())[1]','varchar(50)') AS INVENTORY_NUM
, sales.value('(QtyOrdered/text())[1]','int') AS QTY_ORDERED
, sales.value('(UnitofMeasure/text())[1]','varchar(20)') AS ITEM_UOM
, sales.value('(ItemDesc/text())[1]','nvarchar(max)') AS ITEM_DESC
, itmcode.value('(CodeType/text())[1]','varchar(20)') AS itemcodetype
, itmcode.value('(CodeID/text())[1]','varchar(50)') AS itemcodeID
FROM MyXMLTable as mxt
CROSS APPLY (VALUES( CAST(REPLACE(mxt.XML_FIELD,'encoding="UTF-8"','encoding="UTF-16"') AS xml) )) v(IMPORTED_XML)
CROSS APPLY IMPORTED_XML.nodes('/header1/header2') AS core(core)
CROSS APPLY core.nodes('OrderFormHeader') AS sales_order(sales_order)
CROSS APPLY core.nodes('OrderFormDetails/OrderFormLines') as base(base)
CROSS APPLY base.nodes('OtherDetails') as sales(sales)
CROSS APPLY sales.nodes('ItemCode') as itmcode(itmcode);