I am reading XML and manipulating the data in various ways. However, many of the XML documents contain ISO character entities. I need to retain these as their entity codes, but when XDocument reads the XML file, it immediately resolves the entities into their respective symbols.
How can I prevent this?
Here is a very small sample of XML with 5 entities listed in a table. I need to read the file but keep the entity codes:
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<table>
<title>iso-amsa.ent</title>
<tgroup cols="3">
<colspec colname="col1" colwidth="0.50*"/>
<colspec colname="col2" align="center" colwidth="0.40*"/>
<colspec colname="col3" colwidth="2.20*"/>
<thead>
<row><entry><para>ISO Entity Name</para></entry><entry><para>Unicode Entity</para></entry><entry><para>Description</para></entry></row>
</thead>
<tbody>
<row><entry><para>cularr</para></entry><entry><para>↶</para></entry><entry><para>ANTICLOCKWISE TOP SEMICIRCLE ARROW</para></entry></row>
<row><entry><para>curarr</para></entry><entry><para>↷</para></entry><entry><para>CLOCKWISE TOP SEMICIRCLE ARROW</para></entry></row>
<row><entry><para>dArr</para></entry><entry><para>⇓</para></entry><entry><para>DOWNWARDS DOUBLE ARROW</para></entry></row>
<row><entry><para>darr2</para></entry><entry><para>⇊</para></entry><entry><para>DOWNWARDS PAIRED ARROWS</para></entry></row>
<row><entry><para>dharl</para></entry><entry><para>⇃</para></entry><entry><para>DOWNWARDS HARPOON WITH BARB LEFTWARDS</para></entry></row>
</tbody>
</tgroup>
</table>
</doc>
This is the very simple means in which I read the file (but I have tried various ways):
string fileName = "C:\MyTestFile.xml";
XDocument _doc = XDocument.Load(fileName);
As soon as the XML is read, it converts the entities to their symbols.
How can I prevent this?
CodePudding user response:
XDocument
does not preserve the text encoding once the XML is loaded. The encoding is part of the base stream, not loaded XML.
If you want to re-encode those entities when you save the XML as a string, you need to use an XmlWriter
with an Encoding
For example, using a MemoryStream
var ms = new MemoryStream();
using (var writer = XmlWriter.Create(ms, new XmlWriterSettings {Encoding = Encoding.ASCII}))
{
_doc.Save(writer);
}
Console.WriteLine(Encoding.ASCII.GetString(ms.GetBuffer(), 0, ms.Length));
Or using a FileStream
using (var fs = new FileStream(@"somePathHere", FileMode.OpenOrCreate, FileAccess.Write))
using (var writer = XmlWriter.Create(fs, new XmlWriterSettings {Encoding = Encoding.ASCII}))
{
_doc.Save(writer);
}