Home > Enterprise >  Why must < be escaped in an XML attribute?
Why must < be escaped in an XML attribute?

Time:11-07

I wonder a bit why < must be escaped in an XML attribute, e.g.

<foo bar="3 < 4" />

From the surrounding (inside a tag, inside an attribute value) it should be quite clear for a parser that it can't be the beginning of a new tag.

What is the reason the XML specification prohibits this?

CodePudding user response:

I don't know precisely, but in many cases the explanation is SGML-compatibility. XML was designed to be a subset of SGML, and therefore didn't allow things that SGML didn't allow.

CodePudding user response:

A less than character (<) must indeed be escaped within attribute values:

Well-Formedness Constraint: No < in Attribute Values

The replacement text of any entity referred to directly or indirectly in an attribute value (other than "&lt;") must not contain a <.

Why?

As you observe, attribute values containing < can be unambiguously parsed. However, the motivation was to make XML's parsing rules as simple as possible...

According to Tim Bray, one of the XML 1.0 W3C Recommendation editors and author of The Annotated XML Specification, which captures some of the rationale behind XML design decisions:

Banishing the <

This rule might seem a bit unnecessary, on the face of it. Since you can't have tags in attribute values, having an < can hardly be confusing, so why ban it?

This is another attempt to make life easy for the DPH. The rule in XML is simple: when you're reading text, and you hit a <, then that's a markup delimiter. Not just sometimes, always. When you want one in the data, you have to use &lt;. Not just sometimes, always. In attribute values too.

This rule has another unintended beneficial side-effect; it makes the catching of certain errors much easier. Suppose you have a chunk of XML as follows:

<a href="notes.html> <img src='notes.gif'></a>

Notice that the notes.html is missing its closing quote. Without the no-&lt; rule, it would be really hard to detect this problem and issue a reasonable error message. Since attribute values can contain almost anything, no error would be detected until the processor finds the next quotation mark. Instead, you get an error message the first time you hit a <, which in the example above, as in many cases, is almost immediately.

Back-link to spec

  • Related