I am attempting to write an XmlDocument
to a file using an encoding other than UTF8 (Encoding.ASCII
in this case) and have XmlWriter
automatically replace characters not supported by the encoding with equivalent character entities. To do this, I am using the sample code from Conversion of the special characters while adding it to the XML innertext in C#. However, for my XML, rather than replacing characters with character entities, it is throwing an exception like the below:
Unable to translate Unicode character \u2018 at index 5852 to specified code page.Encode_Save
What is the cause of this exception? Why aren't the unsupported characters getting escaped as expected?
Code used:
My serialization code:
using (var stream = new FileStream(clsGlobal.outputXMLPath, FileMode.OpenOrCreate))
{
clsGlobal.XMLDoc.Save(stream, indent: false, encoding: Encoding.ASCII, omitXmlDeclaration: false);
}
The code for Save()
from the linked question:
public static class XmlSerializationHelper
{
public static string GetOuterXml(this XmlNode node, bool indent = false, Encoding encoding = null, bool omitXmlDeclaration = false)
{
if (node == null)
return null;
var stream = new MemoryStream();
node.Save(stream, indent: indent, encoding: encoding, omitXmlDeclaration: omitXmlDeclaration, closeOutput: false);
stream.Position = 0;
var reader = new StreamReader(stream);
return reader.ReadToEnd();
}
public static void Save(this XmlNode node, Stream stream, bool indent = false, Encoding encoding = null, bool omitXmlDeclaration = false, bool closeOutput = true) =>
node.Save(stream, new XmlWriterSettings
{
Indent = indent,
Encoding = encoding,
OmitXmlDeclaration = omitXmlDeclaration,
CloseOutput = closeOutput,
});
public static void Save(this XmlNode node, Stream stream, XmlWriterSettings settings)
{
try
{
using (var xmlWriter = XmlWriter.Create(stream, settings))
{
node.WriteTo(xmlWriter);
}
}
catch (Exception ex)
{
clsGlobal.globalErrCount ;
clsGlobal.WriteLog(ex.Message "Encode_Save");
}
}
}
Input XML Data, stored in clsGlobal.XMLDoc
:
<?xml version="1.0" encoding="UTF-8"?>
<article dtd="RSCART3.8">
<art-admin>
<ms-id>BK9781839161964-00123</ms-id>
<doi>10.1039/9781839165580-00123</doi>
</art-admin>
<published type="book">
<journalref>
<title>DNA Photodamage: From Light Absorption to Cellular Responses and Skin Cancer</title>
<sercode>BK</sercode>
<publisher>
<orgname>
<nameelt>Royal Society of Chemistry</nameelt>
</orgname>
</publisher>
<issn type="isbn" />
<cpyrt>© European Society for Photobiology 2022</cpyrt>
</journalref>
<volumeref>
<link />
</volumeref>
<pubfront>
<fpage>0</fpage>
<lpage>0</lpage>
<no-of-pages>0</no-of-pages>
<date>
<year>2022</year>
</date>
</pubfront>
</published>
<art-front>
<titlegrp>
<title>Chapter 2</title>
<title>In Silico Tools to Assess Chemical Hazard</title>
</titlegrp>
<abstract>
<p>
Fundamentally, chemical hazard is a function of structure, and the quickest and cheapest way to predict toxicity is to do so from structure alone. Currently, there are many tools available to predict absorption, distribution, metabolism, and excretion (ADME), as well as some key endpoints, such as LD
<inf>50</inf>
(the minimal dose necessary to kill half the animals exposed), mutagenicity, skin sensitization, and ecotoxicity. While quantitative structure–activity relationships (QSARS) and read-across are well established, the field is rapidly changing with the advent of larger data sets and more sophisticated machine learning approaches. As computational power increases, 3D models may become widely available. However, virtually all models have blind spots, and some endpoints (such as developmental toxicity and endocrine disruption) have proven difficult to predict from structure alone – in these cases, it is necessary to use toxicity tests that capture the complexity of a biological system.
</p>
</abstract>
</art-front>
<art-body>
<section>
<no>0.0</no>
<title>2.1 Introduction</title>
<p>
“It is obvious that there must exist a relation between the chemical constitution and the physiological action of a substance, but as yet scarcely any attempts have been made to discover what this relation is. . . .”
<citref idrefs="cit1">1</citref>
This was written in 1865 by Alexander Crum Brown, a chemist who worked in tandem with a medical student, and represents the very first conjecture of the basic principle that is the foundation of
<it>in silico</it>
toxicology: that, fundamentally, chemical hazard is a function of chemical structure. In theory, then, the quickest and cheapest way to predict toxicity is to do so from structure alone. In practice, as we shall see, this is often challenging – but understanding what we can and cannot predict from structure alone is a good way to understand how chemicals affect biological systems.
</p>
<p>
At its most basic, a chemical can be said to be hazardous when it has the potential to interact with a biological system in a way that causes harm – or to use the regulatory term, “an adverse outcome.” Sometimes the negative effect is because a chemical is a mutagen –
<it>e.g</it>
. an electrophilic chemical might cause alkylation of DNA, which is nucleophilic, resulting in an error in the genetic code and, potentially, cancer. Or, a chemical might have a structure that so closely mimics a biological molecule that it can interact with a receptor for the endogenous molecule – as happens when chemicals that are large and coplanar, such as diethylstilbestrol, bind to the estrogen receptor and therefore prevent normal endocrine signaling. Similar mechanisms are thought to underlie many of the chemicals that are considered potential endocrine disruptors. A chemical can displace something essential –
<it>e.g</it>
. carbon monoxide (CO) binds more strongly to hemoglobin than oxygen, and in sufficient quantities, it will deprive tissues of oxygen, resulting in cellular death and eventually asphyxiation.
</p>
<p>
Sometimes hazard is a straightforward result of the chemical properties of a molecule – most strong acids or bases will cause skin and eye irritation. Other times there are several steps –
<it>e.g</it>
. 2,4-dinitrochlorobenzene can easily be absorbed through the skin barrier, and then bind with many proteins in the dermal layer. These altered proteins (“haptens”) are then recognized by the immune system as “foreign material” – and because your immune system is always on the lookout for foreign proteins, it activates immune cells that respond to the hapten, creating an allergic reaction that will persist. In some cases, the chemical itself is not a problem, but once inside the body, it can be metabolized into something problematic, as in the case of acetaminophen.
</p>
<p>
There are two main components to predicting toxicity. Toxicokinetics refers to how the xenobiotic is absorbed, distributed, metabolized, and excreted. Fundamentally, the balance of these factors determines the biologically effective dose – the amount of a xenobiotic that can cause harm. Toxicodynamics refers to how the chemical reacts in a negative way with biological molecules – proteins, DNA, or the cell membrane. Ultimately, the dose and the manner in which a compound causes harm determines whether there are effects at the cellular level. Severe enough effects at the cellular level eventually cause organ damage – the harmful outcome referred to as an “adverse effect.” (
<figref idrefs="fig1">Figure 2.1</figref>
).
</p>
</section>
<section>
<no>0.0</no>
<title>2.5 Conclusion</title>
<p>
Skin sensitization is the one endpoint that also has multiple
<it>in silico</it>
tools models available, ranging from SAR approaches such as ToxTree to more sophisticated QSARs:
<it>e.g.</it>
PredSkin,
<citref idrefs="cit47">47</citref>
which is based on human data and available
<it>via</it>
the web, and the OECD QSAR Toolbox,
<citref idrefs="cit48">48</citref>
which has an automated workflow for skin sensitization. In general, most of these models perform well (with the OECD QSAR Toolbox having 80% balanced accuracy) although the models differ in their sensitivity and specificity
<!--AQ27-->
. The value of 80% might seem disappointing, but the reality is that the animal test these models are built off of – the LLNA test – is only ≈80% reproducible,
<citref idrefs="cit46">46</citref>
and although figures vary, it only predicts human sensitization with a similar level of accuracy.
<citref idrefs="cit49">49</citref>
As yet, these models typically predict binary sensitization status, instead of potency, which is a significant drawback – many chemicals that are very weak sensitizers are often predicted as sensitizers although their actual hazard under most exposure conditions might be small. However, the
<it>in silico</it>
models are, at this point, performing about as well as can be expected given the limitations of the data. Because skin sensitization represents an instance where the toxicodynamics are well understood – something we will discuss in Chapter 3 – it also offers an instance where
<it>in vitro</it>
data can be used as an effective supplement in
<it>in silico</it>
models. Further improvement will likely require new ways to think about combining
<it>in silico</it>
,
<it>in chemico</it>
, and
<it>in vitro</it>
data.
</p>
<p>
Currently, there are many tools available to predict ADME, as well as some key endpoints, such as LD
<inf>50</inf>
, mutagenicity, skin sensitization, and ecotoxicity. We can predict some important endpoints based on others –
<it>e.g.</it>
it does not take a great leap of imagination to understand that most skin irritants will also be eye irritants, even though the reverse is not always true. Skin sensitization should raise a concern for respiratory sensitization, although not conclusively as there are differences in bioavailability and mechanism that means this is not a universal rule.
<citref idrefs="cit50">50</citref>
A chemical that interferes in DNA replication is likely to cause developmental effects should it go through the fetal–placental barrier, but there are many mechanisms by which a chemical can cause developmental effects, and there are no validated models that are considered robust enough for regulatory acceptance. In theory, read-across and QSARs can be used in a well-defined chemical class if the mechanism is known. In practice, given the well-known difficulty of connecting structure to developmental toxicity, this remains an endpoint that requires an
<it>in vivo</it>
study for clarity.
</p>
<p>
Of course, no model is perfect and there are several caveats that apply to all models broadly. A model is only as good as the data that goes into it, and in many instances the data will have a great deal of noise as well as missing data. Most data sets assembled for predictive models will not cover a diverse area of the chemical space, and are often biased towards positives, for the simple reason that people tend not to gather data on chemicals that are largely biologically inert. However, this can be problematic:
<it>e.g.</it>
if a data set consists of 100 chemicals, and 80% of them are considered skin sensitizers, a model that simply declares every chemical a sensitizer will have 80% accuracy. Therefore, when judging model performance, always look to the sensitivity, specificity, and balanced accuracy. Many models, like structural alerts and read-across, are better at identifying toxic compounds than establishing the absence of toxicity. While this is useful for screening-level approaches that are oriented towards being precautionary, it is problematic when trying to decide between chemical candidates in the R&D phase.
</p>
<p>Passive diffusion is relatively easy to predict, because it depends solely on chemical properties, and because of this we have models that will predict diffusion across skin, intestine, and lung tissue. We can also predict whether a chemical will likely passively diffuse across the blood–brain barrier, but have few models that can identify transporter-mediated absorption. With the exception of the relatively well-studied PGP transporter, this has proven very difficult to model because of the diversity of transporters. The probability of a chemical being metabolized by a Phase I enzyme can also be predicted, even if the prediction of the metabolite is more difficult. Finally, based on physical chemical properties, we can estimate overall distribution, excretion, and half-life.</p>
<p>
In terms of toxicodynamics – predicting biological targets of chemicals and the downstream effects –the search space is more complicated both because of the diversity of targets and the biological variability of the subsequent events. Endpoints with a straightforward connection to chemical structure –
<it>e.g.</it>
mutagenicity and skin sensitization, which are both related to electrophilicity – can be proactively identified with structural alerts, and modeled with QSARs. More complicated endpoints can be predicted with limited success, and most such models should be treated with caution. If you do not truly understand the relationship between chemical structure and toxicity, read-across or QSARs will necessarily be limited – you can never know whether two similar molecules are in fact an activity cliff. Moreover, virtually all models will have some blindspots that will reflect the era in which they were developed as well as the data available, and if not updated will tend to become increasingly outdated.
</p>
<p>
Finally,
<it>in silico</it>
approaches can only be used on discrete, organic structures
<!--AQ28-->
. By a rough estimate, however, that means that 50% of the chemicals within commerce cannot be evaluated with
<it>in silico</it>
tools, as they are mixtures (called UVCBs), metal compounds, or salts, in addition to containing impurities – and even small amounts of impurities can give rise to adverse events (
<it>e.g.</it>
sensitization or mutagenicity
<!--AQ29-->
). Such chemicals are likely to increase as many bio-based chemicals are UVCBs, polymers, and engineered nanomaterials, which cannot be handled easily by existing
<it>in silico</it>
tools.
</p>
<p>Glossary</p>
<figure id="fig1" xsrc="BK9781839161964-00123-f1.tif" pos="float">
<title>
Toxicokinetics and toxicodynamics together determine whether a xenobiotic will cause a disease. Adapted from ref.
<citref idrefs="cit51">51</citref>
, https://doi.org/10.14573/altex.1610101, under the terms of the CC BY 4.0 license,
<url url="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</url>
.
</title>
</figure>
<figure id="fig2" xsrc="BK9781839161964-00123-f2.tif" pos="float">
<title>Phase I metabolism involves either oxidation or hydrolysis, typically resulting in a more reactive intermediate. Phase II conjugates the compounds either with glutathione, in the case of electrophiles, or sulfation, acetylation, or glucuronidation to make a compound more water soluble.</title>
</figure>
<figure id="fig3" xsrc="BK9781839161964-00123-f3.tif" pos="float">
<title>
ADME is determined by absorption (ingestion, inhalation, or dermal), distribution primarily
<it>via</it>
The blood and lymph, and excretion
<!--AQ87-->
.
</title>
</figure>
<figure id="fig4" xsrc="BK9781839161964-00123-f4.tif" pos="float">
<title>
Paracetamol metabolism. Paracetamol can be immediately glucuronidated or sulfated without being metabolized by a Phase I enzyme. However, some will be oxidized
<it>via</it>
CYP2E1 into a reactive intermediate.
</title>
</figure>
<figure id="fig5" xsrc="BK9781839161964-00123-f5.tif" pos="float">
<title>Phorbol ester structure, from PubChem.</title>
</figure>
<figure id="fig6" xsrc="BK9781839161964-00123-f6.tif" pos="float">
<title>The ultimate rat carcinogen. Reproduced from Ref. 52, DOI:10.2788/6234, under the terms of the CC BY 4.0 license https://creativecommons.org/licenses/by/4.0/.</title>
</figure>
<figure id="fig7" xsrc="BK9781839161964-00123-f7.tif" pos="float">
<title>Structural analogs for Bisphenol A as selected by GenRA. One the left is ToxPrints, on the right Morgan fingerprints.</title>
</figure>
<table-entry id="tab4">
<title>Table 2.4 Non-commercial read-across and QSAR</title>
<table frame="topbot">
<tgroup cols="3" align="left" colsep="1" rowsep="1" />
<colspec colnum="1" colname="c1" />
<colspec colnum="2" colname="c2" />
<colspec colnum="3" colname="c3" />
<thead />
<tbody>
<row>
<entry>
<bo>
<it>Software</it>
</bo>
</entry>
<entry>
<bo>
<it>Models available</it>
</bo>
</entry>
<entry>
<bo>
<it>Platform</it>
</bo>
</entry>
</row>
<row>
<entry>OECD QSAR Toolbox</entry>
<entry>Read-across, QSARs, QSPR for multiple endpoints</entry>
<entry>Requires Windows</entry>
</row>
<row>
<entry>GenRA</entry>
<entry>Read-across</entry>
<entry>
Available
<it>via</it>
Web at the EPA Comptox Dashboard
</entry>
</row>
<row>
<entry align="char" char=".">T.E.S.T.</entry>
<entry>Global QSAR for acute toxicity, estrogen receptor binding, developmental toxicity, ecotoxicology endpoints</entry>
<entry>
Available
<it>via</it>
web at the EPA Comptox Dashboard and as stand-alone software
</entry>
</row>
<row>
<entry>ECOSAR</entry>
<entry>Ecotoxicology endpoints</entry>
<entry>Available as stand-alone software</entry>
</row>
<row>
<entry>VEGA</entry>
<entry>ADME, Read-across, and QSAR for multiple endpoints</entry>
<entry>Java application for Mac\Linux\</entry>
</row>
<row>
<entry>Danish QSAR Database</entry>
<entry>Global QSARs based on existing models for multiple endpoints; applicability domain indicated</entry>
<entry>
Available
<it>via</it>
the web
</entry>
</row>
</tbody>
</table>
</table-entry>
</section>
</art-body>
<art-back>
<biblist title="References">
<citgroup id="cit1">
<journalcit>
<citauth>
<fname>A. C.</fname>
<surname>Brown</surname>
</citauth>
<citauth>
<fname>T. R.</fname>
<surname>Fraser</surname>
</citauth>
<arttitle>On the Connection between Chemical Constitution and Physiological Action; with special reference to the Physiological Action of the Salts of the Ammonium Bases derived from Strychnia, Brucia, Thebaia, Codeia, Morphia, and Nicotia</arttitle>
<title>J. Anat. Physiol.</title>
<year>1868</year>
<volumeno>2</volumeno>
<pages>
<fpage>224</fpage>
<lpage>242</lpage>
</pages>
</journalcit>
</citgroup>
<citgroup id="cit2">
<journalcit>
<citauth>
<fname>C.</fname>
<surname>Lynch</surname>
</citauth>
<title>Anesth. Analg.</title>
<year>2008</year>
<volumeno>107</volumeno>
<pages>
<fpage>864</fpage>
<lpage>867</lpage>
</pages>
</journalcit>
</citgroup>
<citgroup id="cit3">
<journalcit>
<citauth>
<fname>C. A.</fname>
<surname>Lipinski</surname>
</citauth>
<arttitle>Lead- and drug-like compounds: the rule-of-five revolution</arttitle>
<title>Drug Discov. Today Technol.</title>
<year>2004</year>
<volumeno>1</volumeno>
<pages>
<fpage>337</fpage>
<lpage>341</lpage>
</pages>
</journalcit>
</citgroup>
<citgroup id="cit4">
<journalcit>
<citauth>
<fname>D.</fname>
<surname>Epel</surname>
</citauth>
<citauth>
<fname>T.</fname>
<surname>Luckenbach</surname>
</citauth>
<citauth>
<fname>C. N.</fname>
<surname>Stevenson</surname>
</citauth>
<citauth>
<fname>L. A.</fname>
<surname>Macmanus-Spencer</surname>
</citauth>
<citauth>
<fname>A.</fname>
<surname>Hamdoun</surname>
</citauth>
<citauth>
<fname>T.</fname>
<surname>Smital</surname>
</citauth>
<arttitle>Efflux transporters: newly appreciated roles in protection against pollutants</arttitle>
<title>Environ. Sci. Technol.</title>
<year>2008</year>
<volumeno>42</volumeno>
<pages>
<fpage>3914</fpage>
<lpage>3920</lpage>
</pages>
</journalcit>
</citgroup>
<citgroup id="cit5">
<journalcit>
<citauth>
<fname>L.-A.</fname>
<surname>Clerbaux</surname>
</citauth>
<citauth>
<fname>A.</fname>
<surname>Paini</surname>
</citauth>
<citauth>
<fname>A.</fname>
<surname>Lumen</surname>
</citauth>
<citauth>
<fname>H.</fname>
<surname>Osman-Ponchet</surname>
</citauth>
<citauth>
<fname>A. P.</fname>
<surname>Worth</surname>
</citauth>
<citauth>
<fname>O.</fname>
<surname>Fardel</surname>
</citauth>
<arttitle>Membrane transporter data to support kinetically-informed chemical risk assessment using non-animal methods: Scientific and regulatory perspectives</arttitle>
<title>Environ. Int.</title>
<year>2019</year>
<volumeno>126</volumeno>
<pages>
<fpage>659</fpage>
<lpage>671</lpage>
</pages>
</journalcit>
</citgroup>
<citgroup id="cit54">
<journalcit>
<citauth>
<surname>Oecd</surname>
</citauth>
<arttitle>Data from: EChemPortal: Global portal to information on chemical substances</arttitle>
<title>OECD Obs.</title>
</journalcit>
</citgroup>
</biblist>
<compoundgrp />
<annotationgrp />
<datagrp />
<resourcegrp />
</art-back>
<!--MAQ1: AQ: Please insert the expansion for the acronym ‘PGP’ if appropriate for the reader.-->
<!--MAQ2: CE: The sentence beginning ‘The PGP transporter is expressed.’ has been altered for clarity, please check that the meaning is correct.-->
<!--MAQ3: <AQ>The sentence beginning ‘The fraction unbound in plasma.’ has been altered for clarity, please check that the meaning is correct.</AQ>-->
<!--MAQ5: <AQ>In the sentence beginning ‘In pharmacology studies.’ a word or phrase appears to be missing after ‘is known and the remaining’. Please check this carefully and indicate any changes required here.</AQ>-->
</article>
CodePudding user response:
You will probably need to provide more information (i.e. sample code, a sample input file) to get an accurate answer.
Ultimately the cause of the exception, if you are trying to encode a \u2018
char in ISO-8859-1 or similar encoding, is that the character is not present in that encoding. ISO-8859-1 is an 8-bit encoding that does not contain most Unicode characters, including your character. You will need to encode it as a character entity reference: ‘
.
CodePudding user response:
You problem is that are trying to write Unicode characters not supported by the current ASCII encoding inside an XML comment, specifically the left and right single quote marks inside this comment:
<!--MAQ1: AQ: Please insert the expansion for the acronym ‘PGP’ if appropriate for the reader.-->
Since these characters cannot be encoded into an XML comment, your XmlWriter
throws the exception you see.
But why can't these characters be replaced by character entity fallbacks? As explained in the answer to the linked question Conversion of the special characters while adding it to the XML innertext in C#, the writer returned by XmlWriter.Create(stream, new XmlWriterSettings { Encoding = encoding })
will automatically replace Unicode characters in text content and attribute values not supported by the specified encoding with equivalent character entities. Thus if you output write the XML <Root>‘</Root>
using Encoding.ASCII
, you will get <Root>‘</Root>
:
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml("<Root>‘</Root>");
// Output to XML and escape all non-ASCII characters.
var xml = xmlDoc.GetOuterXml(encoding : Encoding.ASCII, omitXmlDeclaration : true);
Demo fiddle #1 here.
But what about unsupported characters in an XML comment? As explained by the XML Specification, comments are not actually part of the document's character data:
[Definition: Comments may appear anywhere in a document outside other markup; in addition, they may appear within the document type declaration at places allowed by the grammar. They are not part of the document's character data; an XML processor MAY, but need not, make it possible for an application to retrieve the text of comments...
[15] Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->' [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Furthermore, as can be seen from the formal grammar, comment text does not support character entity replacement. Thus XmlWriter
cannot replace an unsupported character with anything equivalent, and throws an exception instead:
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml("<Root><!--‘--></Root>");
var xml = xmlDoc.GetOuterXml(encoding : Encoding.ASCII, omitXmlDeclaration : true); // Fails and throws an exception
Demo fiddle #2 here.
So, what are your possible workarounds?
Firstly, you could just strip all comments before writing. Comments are not actually part of the document content anyway and are generally ignored. To strip comments see How to remove all comment tags from XmlDocument.
Secondly, you could create a custom XmlWriter
decorator that replaces unsupported comments with some fallback specified by the incoming encoding as they are being written. The following does this:
public static class XmlSerializationHelper
{
public static string GetOuterXml(this XmlNode node, bool indent = false, Encoding encoding = null, bool omitXmlDeclaration = false)
{
if (node == null)
return null;
using var stream = new MemoryStream();
node.Save(stream, indent : indent, encoding : encoding, omitXmlDeclaration : omitXmlDeclaration, closeOutput : false);
stream.Position = 0;
using var reader = new StreamReader(stream);
return reader.ReadToEnd();
}
public static void Save(this XmlNode node, Stream stream, bool indent = false, Encoding encoding = null, bool omitXmlDeclaration = false, bool closeOutput = true) =>
node.Save(stream, new XmlWriterSettings
{
Indent = indent,
Encoding = encoding ?? Encoding.UTF8,
OmitXmlDeclaration = omitXmlDeclaration,
CloseOutput = closeOutput,
});
public static void Save(this XmlNode node, Stream stream, XmlWriterSettings settings)
{
using var xmlWriter = XmlWriter.Create(stream, settings);
using var outerWriter = (settings?.Encoding != null && settings?.Encoding?.CodePage != Encoding.UTF8.CodePage) ? new TolerantCommentEncodingXmlWriter(xmlWriter, settings.Encoding) : null;
node.WriteTo(outerWriter ?? xmlWriter);
}
}
public class TolerantCommentEncodingXmlWriter : XmlWriterDecorator
{
Encoding CommentEncoding { get; }
public TolerantCommentEncodingXmlWriter(XmlWriter baseWriter, Encoding commentEncoding) : base(baseWriter) => this.CommentEncoding = commentEncoding;
public override void WriteComment(string text) =>
base.WriteComment(CommentEncoding?.GetString(CommentEncoding?.GetBytes(text)) ?? text);
}
public class XmlWriterDecorator : XmlWriter
{
// Taken from this answer https://stackoverflow.com/a/32150990/3744182
// by https://stackoverflow.com/users/3744182/dbc
// To https://stackoverflow.com/questions/32149676/custom-xmlwriter-to-skip-a-certain-element
// NOTE: async methods not implemented
readonly XmlWriter baseWriter;
public XmlWriterDecorator(XmlWriter baseWriter) => this.baseWriter = baseWriter ?? throw new ArgumentNullException();
protected virtual bool IsSuspended { get { return false; } }
public override WriteState WriteState => baseWriter.WriteState;
public override XmlWriterSettings Settings => baseWriter.Settings;
public override XmlSpace XmlSpace => baseWriter.XmlSpace;
public override string XmlLang => baseWriter.XmlLang;
public override void Close() => baseWriter.Close();
public override void Flush() => baseWriter.Flush();
public override string LookupPrefix(string ns) => baseWriter.LookupPrefix(ns);
public override void WriteBase64(byte[] buffer, int index, int count)
{
if (IsSuspended)
return;
baseWriter.WriteBase64(buffer, index, count);
}
public override void WriteCData(string text)
{
if (IsSuspended)
return;
baseWriter.WriteCData(text);
}
public override void WriteCharEntity(char ch)
{
if (IsSuspended)
return;
baseWriter.WriteCharEntity(ch);
}
public override void WriteChars(char[] buffer, int index, int count)
{
if (IsSuspended)
return;
baseWriter.WriteChars(buffer, index, count);
}
public override void WriteComment(string text)
{
if (IsSuspended)
return;
baseWriter.WriteComment(text);
}
public override void WriteDocType(string name, string pubid, string sysid, string subset)
{
if (IsSuspended)
return;
baseWriter.WriteDocType(name, pubid, sysid, subset);
}
public override void WriteEndAttribute()
{
if (IsSuspended)
return;
baseWriter.WriteEndAttribute();
}
public override void WriteEndDocument()
{
if (IsSuspended)
return;
baseWriter.WriteEndDocument();
}
public override void WriteEndElement()
{
if (IsSuspended)
return;
baseWriter.WriteEndElement();
}
public override void WriteEntityRef(string name)
{
if (IsSuspended)
return;
baseWriter.WriteEntityRef(name);
}
public override void WriteFullEndElement()
{
if (IsSuspended)
return;
baseWriter.WriteFullEndElement();
}
public override void WriteProcessingInstruction(string name, string text)
{
if (IsSuspended)
return;
baseWriter.WriteProcessingInstruction(name, text);
}
public override void WriteRaw(string data)
{
if (IsSuspended)
return;
baseWriter.WriteRaw(data);
}
public override void WriteRaw(char[] buffer, int index, int count)
{
if (IsSuspended)
return;
baseWriter.WriteRaw(buffer, index, count);
}
public override void WriteStartAttribute(string prefix, string localName, string ns)
{
if (IsSuspended)
return;
baseWriter.WriteStartAttribute(prefix, localName, ns);
}
public override void WriteStartDocument(bool standalone) => baseWriter.WriteStartDocument(standalone);
public override void WriteStartDocument() => baseWriter.WriteStartDocument();
public override void WriteStartElement(string prefix, string localName, string ns)
{
if (IsSuspended)
return;
baseWriter.WriteStartElement(prefix, localName, ns);
}
public override void WriteString(string text)
{
if (IsSuspended)
return;
baseWriter.WriteString(text);
}
public override void WriteSurrogateCharEntity(char lowChar, char highChar)
{
if (IsSuspended)
return;
baseWriter.WriteSurrogateCharEntity(lowChar, highChar);
}
public override void WriteWhitespace(string ws)
{
if (IsSuspended)
return;
baseWriter.WriteWhitespace(ws);
}
}
Then, for the XML <Root>‘<!--‘--></Root>
, using Encoding.ASCII
the ‘
will be replaced with ?
<Root>‘<!--?--></Root>
While for Encoding.Latin1
it will be replaced with '
:
<Root>‘<!--'--></Root>
Demo fiddle #3 here. Demo fiddle #4 showing your original XML being written here.
Notice that Latin1
uses a slightly better fallback than ASCII
. This is discussed in the documentation page How to use character encoding classes in .NET: Choosing a Fallback Strategy:
Best-Fit Fallback
When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar character. (Best-fit fallback is mostly an encoding rather than a decoding issue. There are very few code pages that contain characters that cannot be successfully mapped to Unicode.) Best-fit fallback is the default for code page and double-byte character set encodings that are retrieved by the Encoding.GetEncoding(Int32) and Encoding.GetEncoding(String) overloads.
...
Replacement Fallback
When a character does not have an exact match in the target scheme, but there is no appropriate character that it can be mapped to, the application can specify a replacement character or string... It is also the default behavior of the ASCIIEncoding class, which replaces each character that it cannot encode or decode with a question mark.
But no matter which fallback you choose, if you write comment text containing characters not supported by your current encoding, the unsupported characters will be lost or remapped in some manner.