Java XSLT transformer with default namepace without xmlns-CodePudding

I'm working on some Java code that takes XML in DOM, with no namespace prefixes declared, yet each element has a namespace of http://www.w3.org/1999/xhtml. (This is equivalent to the HTML DOM a browser gets.) The code uses the following to serialize the DOM to a string:

TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();

The resulting string looks like this:

…
<html xmlns="http://www.w3.org/1999/xhtml">
…

Note the presence of xmlns="http://www.w3.org/1999/xhtml", which the DOM did not have. In terms of XML, this is entirely correct: if the element uses a namespace (even without a prefix), the namespace must be declared on that element or a an ancestor element; and this being the document element, the namespace declaration must go here.

However HTML is a little different story. The WHATWG HTML5 Specification § 2.1.3 XML compatibility says:

To ease migration from HTML to XML, user agents conforming to this specification will place elements in HTML in the http://www.w3.org/1999/xhtml namespace, at least for the purposes of the DOM and CSS.

In other words, HTML browsers will assume a namespace of http://www.w3.org/1999/xhtml namespace even without a namespace declaration. And typical clean HTML will not have a namespace declaration. And for this particular use case, a namespace declaration is not required.

How can I tell a transformer not to add a default namespace declaration for the document? Alternatively, how can I remove it later without resorting to brute force such as regular expression matching?

Internally the com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl creates a com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO instance, which eventually calls com.sun.org.apache.xml.internal.serializer.ToStream.startPrefixMapping(String prefix, String uri, boolean shouldFlush). Here is the "offending" code that adds the xmlns="http://www.w3.org/1999/xhtml" on the document element:

if (EMPTYSTRING.equals(prefix))
{
  name = "xmlns";
  addAttributeAlways(XMLNS_URI, name, name, "CDATA", uri, false);
}

But to be more precise, I see that the actual adding of the attribute is done by com.sun.org.apache.xml.internal.serializer.AttributesImplSerializer.addAttribute(String uri, String local, String qname, String type, String val). This class extends org.xml.sax.helpers.AttributesImpl and implements org.xml.sax.Attributes.

Is there some way I can splice my own customized Attributes implementation into a Transformer, so that I can check this special case and forgo adding the xmlns="http://www.w3.org/1999/xhtml" attribute in the appropriate context?

I suppose as a last resort, is there a way to tell the Transformer to be namespace aware, but never to add xmlns declarations that weren't already in the DOM?

(For those who insist in asking where I got a DOM with an HTML namespace without an xmlns declaration, it's irrelevant. Let's assume that I constructed an XML DOM instance programmatically but want to output it as "clean" HTML5, so I remove the default xmlns attribute, but the Transformer is putting it back.)

CodePudding user response：

(Full disclosure: I'm actually fixing a bug in jsoup, which is an HTML parser that accepts dirty HTML as in the wild, and presents it to the application in DOM as a browser would. I fixed a bug that didn't assign the HTML namespace even without a namespace declaration. Now the existing W3CDom.asString(Document doc) serializer method tries to add the xmlns namespace declaration, but users are accustomed to it returning an HTML serialization without the xmlns (which for HTML5 isn't wrong). So I'm trying to keep from breaking code that relies on the original "clean" HTML serialization without rewriting the serializer.)

The following is an ugly kludge, but given the constraints I don't see an alternative. I welcome a better approach!

/**
 * Pattern to detect the <code>xmlns="http://www.w3.org/1999/xhtml"</code> default namespace
 * declaration when serializing the DOM to HTML. This pattern is "good enough", relying in part
 * on the output of the {@link Transformer} used in the implementation, but is not a complete
 * solution for all the serializations possible; that is, if one constructed an XML string
 * manually, it might be possible to find an obscure variation that this pattern would not
 * match.
 */
static final Pattern HTML_DEFAULT_NAMESPACE_PATTERN =
    Pattern.compile("<html[^>]*(\\sxmlns=['\"]http://www.w3.org/1999/xhtml['\"])");

/**
 * Removes the default <code>xmlns="http://www.w3.org/1999/xhtml"</code> HTML namespace
 * declaration if present in the string.
 * 
 * @param html The serialized HTML.
 * @return A string without the default <code>xmlns="http://www.w3.org/1999/xhtml"</code> HTML
 *         namespace declaration.
 * @see <a href="https://github.com/jhy/jsoup/issues/1837">Issue #1837: Bug: DOM elements not
 *      being placed in (X)HTML namespace.</a>
 */
static String removeDefaultHtmlNamespaceDeclaration(String html) {
    Matcher matcher = HTML_DEFAULT_NAMESPACE_PATTERN.matcher(html);
    if (matcher.find()) {
      html = html.substring(0, matcher.start(1))   html.substring(matcher.end(1));
    }
    return html;
}

CodePudding user response：

It looks to me as if the DOM was created by an application that put the nodes in the XHTML namespace, and therefore the serializer is entirely correct to serialize them in that namespace. From your description, the application did that because it was parsing HTML5 and that's what the HTML5 specification says it should do.

Part of the problem is that you're using an XSLT 1.0 serializer, and XSLT 1.0 predates XHTML and certainly predates HTML5. Unfortunately, just because W3C or WHATWG issues a proclamation doesn't mean that everyone changes their software. You may have better luck using an XSLT 3.0 serializer (Saxon) with the HTML5 output method, but I don't know what your project constraints are.