Home > Software engineering >  How can I stop XDocument from resolving my character entities
How can I stop XDocument from resolving my character entities

Time:10-04

I am reading XML and manipulating the data in various ways. However, many of the XML documents contain ISO character entities. I need to retain these as their entity codes, but when XDocument reads the XML file, it immediately resolves the entities into their respective symbols.

How can I prevent this?

Here is a very small sample of XML with 5 entities listed in a table. I need to read the file but keep the entity codes:

<?xml version="1.0" encoding="UTF-8"?>
<doc>
<table>
    <title>iso-amsa.ent</title>
    <tgroup cols="3">
        <colspec colname="col1" colwidth="0.50*"/>
        <colspec colname="col2" align="center" colwidth="0.40*"/>
        <colspec colname="col3" colwidth="2.20*"/>
        <thead>
            <row><entry><para>ISO Entity Name</para></entry><entry><para>Unicode Entity</para></entry><entry><para>Description</para></entry></row>
        </thead>
        <tbody>
            <row><entry><para>cularr</para></entry><entry><para>&#x21B6;</para></entry><entry><para>ANTICLOCKWISE TOP SEMICIRCLE ARROW</para></entry></row>
            <row><entry><para>curarr</para></entry><entry><para>&#x21B7;</para></entry><entry><para>CLOCKWISE TOP SEMICIRCLE ARROW</para></entry></row>
            <row><entry><para>dArr</para></entry><entry><para>&#x21D3;</para></entry><entry><para>DOWNWARDS DOUBLE ARROW</para></entry></row>
            <row><entry><para>darr2</para></entry><entry><para>&#x21CA;</para></entry><entry><para>DOWNWARDS PAIRED ARROWS</para></entry></row>
            <row><entry><para>dharl</para></entry><entry><para>&#x21C3;</para></entry><entry><para>DOWNWARDS HARPOON WITH BARB LEFTWARDS</para></entry></row>
        </tbody>
    </tgroup>
</table>
</doc>

This is the very simple means in which I read the file (but I have tried various ways):

string fileName = "C:\MyTestFile.xml";

XDocument _doc = XDocument.Load(fileName);

As soon as the XML is read, it converts the entities to their symbols.

How can I prevent this?

CodePudding user response:

XDocument does not preserve the text encoding once the XML is loaded. The encoding is part of the base stream, not loaded XML.

If you want to re-encode those entities when you save the XML as a string, you need to use an XmlWriter with an Encoding

For example, using a MemoryStream

var ms = new MemoryStream();
using (var writer = XmlWriter.Create(ms, new XmlWriterSettings {Encoding = Encoding.ASCII}))
{
    _doc.Save(writer);
}
Console.WriteLine(Encoding.ASCII.GetString(ms.GetBuffer(), 0, ms.Length));

Or using a FileStream

using (var fs = new FileStream(@"somePathHere", FileMode.OpenOrCreate, FileAccess.Write))
using (var writer = XmlWriter.Create(fs, new XmlWriterSettings {Encoding = Encoding.ASCII}))
{
    _doc.Save(writer);
}
  • Related