Home > Blockchain >  Parse and save XML without replacing >
Parse and save XML without replacing >

Time:06-23

I have to parse, modify and save a XML document that contains > in an attribute value.

Contrary to popular belief it's perfectly fine for this character to NOT be replaced with > as described in the standard:

The right angle bracket (>) may be represented using the string >, and must, for compatibility, be escaped using either > or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

(2.4 Character Data and Markup)

I cannot allow the parser to modify the attribute values since existing code relies on the current form (also it would make the XML rather unwieldy).

A sample would be:

<?xml version="1.0" encoding="utf-8"?>
<Foo Name="a->b">
</Foo>

Neither XmlDocument nor XDocument can load and save this document without changing the a->b to a-&gt;b.

Is there any way to work around this? I could fix the data in a post-processing step, but there are situations where > must be escaped so this seems rather error-prone.

CodePudding user response:

XDocument (and more generally XmlReader) will load XML without converting > characters to &gt; (In fact just the opposite happens -- &gt; will be unescaped to > by XmlReader). You may verify that by doing:

var xmlString = @"<?xml version=""1.0"" encoding=""utf-8""?><Foo Name=""a->b""></Foo>";
var doc = XDocument.Parse(xmlString);
Assert.AreEqual("a->b", doc.Root.Attribute("Name").Value); // Passes successfully

Demo fiddle #1 here.

Instead what you are seeing is that, when writing your XDocument back to XML, XmlWriter unconditionally escapes > as &gt; even when not strictly necessary. (An XmlWriter is always used to format an XNode to XML, either explicitly when you construct it yourself to write to some Stream or TextWriter, or internally by XNode.ToString().)

If you don't want this, you will have to subclass XmlWriter and modify the logic of XmlWriter.WriteString(String) to use your preferred escaping. However XmlWriter itself is abstract; the XmlWriter returned by XmlWriter.Create() is some internal concrete subclass which cannot be subclassed directly. Thus you will need to use the decorator pattern to wrap the writer returned by XmlWriter.Create():

public class NoEndBracketEscapingXmlWriter : XmlWriterDecorator
{
    public NoEndBracketEscapingXmlWriter(XmlWriter baseWriter) : base(baseWriter) { }
    
    public override void WriteString(string text)
    {
        //The right angle bracket (>) may be represented using the string &gt;, and must, for compatibility, be escaped using either &gt; or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
        int prevIndex = 0, index;
        char [] buffer = null;
        while ((index = text.IndexOf('>', prevIndex)) >= 0)
        {
            if (buffer == null)
                buffer = text.ToCharArray();
            if (text.AsSpan().Slice(index   1).StartsWith(" ") && text.AsSpan().Slice(prevIndex, index - prevIndex).EndsWith(" ]]"))
            {
                // > appearing in " ]]> " must still be escaped
                base.WriteChars(buffer, prevIndex, index - prevIndex   1);
            }
            else
            {
                base.WriteChars(buffer, prevIndex, index - prevIndex);
                base.WriteRaw(">");
            }
            prevIndex = index   1;
        }

        if (buffer == null)
            base.WriteString(text);
        else if (prevIndex < buffer.Length)
            base.WriteChars(buffer, prevIndex, buffer.Length - prevIndex);
    }
}

public class XmlWriterDecorator : XmlWriter
{
    // Taken from this answer https://stackoverflow.com/a/32150990/3744182
    // by https://stackoverflow.com/users/3744182/dbc
    // To https://stackoverflow.com/questions/32149676/custom-xmlwriter-to-skip-a-certain-element
    // NOTE: async methods not implemented
    readonly XmlWriter baseWriter;

    public XmlWriterDecorator(XmlWriter baseWriter) => this.baseWriter = baseWriter ?? throw new ArgumentNullException();

    protected virtual bool IsSuspended { get { return false; } }

    public override void Close() => baseWriter.Close();

    public override void Flush() => baseWriter.Flush();

    public override string LookupPrefix(string ns) => baseWriter.LookupPrefix(ns);

    public override void WriteBase64(byte[] buffer, int index, int count)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteBase64(buffer, index, count);
    }

    public override void WriteCData(string text)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteCData(text);
    }

    public override void WriteCharEntity(char ch)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteCharEntity(ch);
    }

    public override void WriteChars(char[] buffer, int index, int count)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteChars(buffer, index, count);
    }

    public override void WriteComment(string text)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteComment(text);
    }

    public override void WriteDocType(string name, string pubid, string sysid, string subset)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteDocType(name, pubid, sysid, subset);
    }

    public override void WriteEndAttribute()
    {
        if (IsSuspended)
            return;
        baseWriter.WriteEndAttribute();
    }

    public override void WriteEndDocument()
    {
        if (IsSuspended)
            return;
        baseWriter.WriteEndDocument();
    }

    public override void WriteEndElement()
    {
        if (IsSuspended)
            return;
        baseWriter.WriteEndElement();
    }

    public override void WriteEntityRef(string name)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteEntityRef(name);
    }

    public override void WriteFullEndElement()
    {
        if (IsSuspended)
            return;
        baseWriter.WriteFullEndElement();
    }

    public override void WriteProcessingInstruction(string name, string text)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteProcessingInstruction(name, text);
    }

    public override void WriteRaw(string data)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteRaw(data);
    }

    public override void WriteRaw(char[] buffer, int index, int count)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteRaw(buffer, index, count);
    }

    public override void WriteStartAttribute(string prefix, string localName, string ns)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteStartAttribute(prefix, localName, ns);
    }

    public override void WriteStartDocument(bool standalone) => baseWriter.WriteStartDocument(standalone);

    public override void WriteStartDocument() => baseWriter.WriteStartDocument();

    public override void WriteStartElement(string prefix, string localName, string ns)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteStartElement(prefix, localName, ns);
    }

    public override WriteState WriteState => baseWriter.WriteState;

    public override void WriteString(string text)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteString(text);
    }

    public override void WriteSurrogateCharEntity(char lowChar, char highChar)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteSurrogateCharEntity(lowChar, highChar);
    }

    public override void WriteWhitespace(string ws)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteWhitespace(ws);
    }
}   

And then you could use it e.g. in the following extension method:

public static class XNodeExtensions
{
    public static string ToStringNoEndBracketEscaping(this XNode node)
    {
        if (node == null)
            throw new ArgumentNullException(nameof(node));
        using var textWriter = new StringWriter();
        using (var innerWriter = XmlWriter.Create(textWriter, new XmlWriterSettings { Indent = true, OmitXmlDeclaration = true }))
        using (var writer = new NoEndBracketEscapingXmlWriter(innerWriter))
        {
            node.WriteTo(writer);
        }
        return textWriter.ToString();
    }
}

And now if you do

var newXml = doc.ToStringNoEndBracketEscaping();

The result will be

<Foo Name="a->b"></Foo>

Demo fiddle #2 here.

  • Related