Extract XML from CData-CodePudding

I have the following problem. I need to write an application in C# that will read a given XML and prepare data for me to load into the database. In XML, the structure of which I have no influence, the main data is placed in CDATA. I have verified that the structure of this data is ordered in the correct XML structure.

I've searched hundreds of posts and can't find any solution from them. Below is the XML file from which I need to extract the data from the CDATA section. Maybe one of you can help me?

<Docs>
  <Doc>
  <Content>
    <![CDATA[
      <Doc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <Header DocNumber="1" Description="Desc1"></Header>
        <Poss>
          <Pos Id="1" Name="Pos1"></Pos>
          <Pos Id="2" Name="Pos2"></Pos>
        </Poss>
      </Doc>
    ]]>
  </Content>
  </Doc>
  <Doc>
  <Content>
    <![CDATA[
      <Doc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <Header DocNumber="2" Description="Desc2"></Header>
        <Poss>
          <Pos Id="3" Name="Pos3"></Pos>
          <Pos Id="4" Name="Pos4"></Pos>
        </Poss> 
      </Doc>
    ]]>
  </Content>
  </Doc>
</Docs>

For me, the most important are the fields contained in the Content section and I have to load them as data into the database.

CodePudding user response：

Here is implementation based on a stored procedure with the XML parameter like in my comments.

I had to remove the <Poss> XML element to make CData section a well-formed XML.

SQL

DECLARE @xml XML = 
N'<Docs>
    <Doc>
        <Content><![CDATA[
      <Doc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <Header DocNumber="1" Description="Desc1"></Header>
          <Pos Id="1" Name="Pos1"></Pos>
          <Pos Id="2" Name="Pos2"></Pos>
      </Doc>
    ]]>
        </Content>
    </Doc>
    <Doc>
        <Content><![CDATA[
      <Doc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <Header DocNumber="2" Description="Desc2"></Header>
          <Pos Id="3" Name="Pos3"></Pos>
          <Pos Id="4" Name="Pos4"></Pos>
      </Doc>
    ]]>
        </Content>
    </Doc>
</Docs>';

--INSERT INTO <targetTable>
SELECT h.value('(Header/@DocNumber)[1]', 'INT') AS DocNumber
    , h.value('(Header/@Description)[1]', 'VARCHAR(256)') AS DocDescription
    , d.value('@Id', 'INT') AS posID
    , d.value('@Name', 'VARCHAR(256)') AS posName
FROM @xml.nodes('/Docs/Doc/Content/text()') AS t(c)
CROSS APPLY (SELECT TRY_CAST(c.query('.').value('.', 'NVARCHAR(MAX)') AS XML)) AS t1(x)
CROSS APPLY x.nodes('/Doc') AS t2(h)
CROSS APPLY h.nodes('Pos') AS t3(d);

Output

DocNumber	DocDescription	posID	posName
2	Desc2	3	Pos3
2	Desc2	4	Pos4
1	Desc1	1	Pos1
1	Desc1	2	Pos2

CodePudding user response：

To extract the data from the CData part,

Construct the classes.

public class Doc
{
    public Header Header { get; set; }
    
    [XmlArrayItem(typeof(Pos), ElementName = "Pos")]
    public List<Pos> Poss { get; set; }
}

public class Header
{
    [XmlAttribute]
    public int DocNumber { get; set; }
    
    [XmlAttribute]
    public string Description { get; set; }
}

public class Pos
{
    [XmlAttribute]
    public int Id { get; set; }
    
    [XmlAttribute]
    public string Name { get; set; }
}

Implement the extraction logic.

2.1. Read the XML string as XDocument via XDocument.Parse().

2.2. Select the DescendantNodes for the XPath of "/Docs/Doc/Content".

2.3. Convert the nodes to XCData type.

2.4. With XmlSerializer to deserialize the value of XCData to Doc type.

using System.Linq;
using System.Xml;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Serialization;
using System.IO;

XmlSerializer xmlSerializer = new XmlSerializer(typeof(Doc)); 
XDocument xDoc = XDocument.Parse(xml);
        
var cdataSections = xDoc.XPathSelectElements("/Docs/Doc/Content")
    .DescendantNodes()
    .OfType<XCData>()
    .Select(x => (Doc)xmlSerializer.Deserialize(new StringReader(x.Value)))
    .ToList();

Demo @ .NET Fiddle