How to parse and modify XHTML in Node.js (supporting HTML entities and CDATA sections)?-CodePudding

I am developing a Node.js app that receives an XHTML snippet (Confluence storage format), should make some modifications to it and then send back the modified XHTML. The XHTML may contain HTML entities (such as ö) and also CDATA sections (such as <![CDATA[test]]>).

The challenge that I’m running into is that with the parsers that I have tried, when I parse the snippet in HTML mode, the CDATA sections break, but when I parse it in XML mode, the HTML entities are not interpreted correctly.

Below is an example how I got this to work in the browser, but how I failed to get it to work using jsdom and cheerio. Is there any other library that I could use to achieve this, or any different way to use jsdom or cheerio?

In the browser

In the browser, I can work with DOMParser in XML mode. Working with the test snippet <span>ö<![CDATA[ä]]></span>, I can wrap it in an XHTML body:

const doc = new DOMParser().parseFromString(`<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><span>&ouml;<![CDATA[ä]]></span></body></html>`, 'application/xml');
doc.querySelector('body').innerHTML;   // <span>ö<![CDATA[ä]]></span>
doc.querySelector('body').textContent; // öä

The XML MIME type ensures that the CDATA section is interpreted correctly, while the XHTML DOCTYPE makes sure that the entities are supported.

jsdom

To achieve the same in Node.js, I attempted to use jsdom. The problem is that when I parse the code in HTML mode, the CDATA section gets converted into a comment, but when I parse it in XML mode, an exception is thrown because of the HTML entity:

import { JSDOM } from 'jsdom';
const xhtml = `<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><span>&ouml;<![CDATA[ä]]></span></body></html>`;

new JSDOM(xhtml).window.document.body.innerHTML; // <span>ö<!--[CDATA[ä]]--></span>
new JSDOM(xhtml).window.document.body.textContent; // ö
new JSDOM(xhtml, { contentType: 'application/xml' }); // Uncaught DOMException [SyntaxError]: about:blank:1:186: undefined entity.

Update: I have reported the problem to jsdom.

cheerio

My preferred method to do DOM modifications in the backend would be cheerio. Using cheerio in HTML mode, the CDATA section gets converted into a comment. In XML mode, the entity is not interpreted but rather double-escaped into &ouml;. In XML mode without decoding entities, the XHTML is preserved correctly, but the entities are not interpreted correctly, which can be seen when getting the text content.

import cheerio from 'cheerio';
const xhtml = `<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><span>&ouml;<![CDATA[ä]]></span></body></html>`;

cheerio.load(xhtml).root().find('body').html(); // <span>ö<!--[CDATA[ä]]--></span>
cheerio.load(xhtml).root().find('body').text(); // ö
cheerio.load(xhtml, { xmlMode: true }).root().find('body').html(); // <span>&amp;ouml;<![CDATA[ä]]></span>
cheerio.load(xhtml, { xmlMode: true }).root().find('body').html(); // &ouml;ä
cheerio.load(xhtml, { xmlMode: true, decodeEntities: false }).root().find('body').html(); // <span>&ouml;<![CDATA[ä]]></span>
cheerio.load(xhtml, { xmlMode: true, decodeEntities: false }).root().find('body').text(); // &ouml;ä

Update: I have reported the problem to cheerio.

CodePudding user response：

I was pointed out a workaround for the issue in cheerio:

cheerio.load(xhtml, { xml: { xmlMode: false, recognizeCDATA: true, recognizeSelfClosing: true } });

With these options, I can successfully parse XHTML in a Node.js environment.

In addition to this solution, I noticed that using the DOMParser in the browser has the disadvantage that there are inconsistencies between the browsers. In particular, when using query selectors in combination with XML namespaces, I sometimes had to include the namespace in the query and sometimes not. Because of these inconsistencies, jquery also officially doesn't support XML namespaces. To achieve consistent behaviour between the browsers and also between the frontend, frontend tests and backend, I decided to use cheerio even for parsing XHTML in the browser.