How to escape all HTML in a string except <a>?-CodePudding

I am making a chat app, and I want to escape all HTML sent to the event (except <a> tags, because links are auto-converted to HTML).

This is my escape function:

const escapeHtml = (unsafe) => {
        return unsafe.replaceAll('<', '&lt;').replaceAll('>', '&gt;');
    };

and this is the element's HTML (that is shown to the user/client):

message.innerHTML = escapeHtml(json.username)   ":<br/>"   escapeHtml(json.message);

CodePudding user response：

Note: I know this will receive some hate by future readers who came with the same question, but hear me out:

You're probably better off not doing that.

Escaping HTML correctly is already hard on itself. HTML is a very complex language with many edge-cases and is far from being "regular".
There are good tools and great solutions to escape HTML, but these will work when escaping the whole input. Trying to modify one of these solutions so that it includes your special case will inevitably bring security risks. For instance at the time I write this answer, two other answers have been posted, both of which I could perform an XSS attack on, in less than a few minutes. I'm not even a security researcher, just someone who knows HTML.

So, tweaking a sanitizer isn't the way to go, nor is to write one yourself. Then what?

What you want is to let your users write some text, and append links in there. No need for HTML to do that. You can have a totally different markup language that will define that a given sequence should be treated as a link, but that won't understand any of HTML, and moreover, that any HTML parser won't understand as being HTML.

For instance Common Mark, that we do use on this very website, or on GitHub and many other places, does just that. It's defined that the sequence [word](https://example.com) will create the anchor word and we can store this by escaping any HTML we want without any risk. And even better, you don't even need to escape the content, because now you can avoid entirely dangerous methods like setting .innerHTML and stick to safe .textContent.
But you shouldn't even worry about that either, because there are many well written Common Mark parsers and renderers that will generate just what you need directly.

CodePudding user response：

The .replaceAll() function accepts a function as the replacement argument if you need to do more advanced processing. You can use that function to decide if you want to do the replacement or not:

unsafe.replaceAll(/<([^\s>] ?)(.*?)>/g,(match, tag, remainder) => {
    if (tag === 'a' || tag === '/a') {
        return `<${tag}${remainder}>`;
    }
    else {
        return `&lt${tag}${remainder}&gt`
    }
});

You can read about how to use functions as replacement from the docs: Specifying functions as the replacement.

Explanation of the regular expression:

The regexp is just looking for all tags enclosed by < and >:

<        // starts with <
(        // remember this matching group (function 2nd argument)
  [^     // anything that is not
    \s   // whitespace
    >    // or >
  ]
         // one or more of the above
)
(        // remember this matching group (function 3rd argument)
  .      // any character
  *      // zero or more of the above
  ?      // don't be greedy
)
>        // ends with >

Note that this regular expression expects all HTML tags to immediately start with the tag name (eg <a>). It breaks if you have tags that have whitespace before the tag name, for example:

// the regexp above does not work if your string looks like this:

hello <
         a href="/world"
      > world </a>

To fix that you can add a zero or more whitespace pattern (\s*) right after <:

/<\s*([^\s>] ?)(.*?)>/g