How can I remove all HTML tags in a string except for the only 's' tag?-CodePudding

I'm trying to remove all html tags except only <s></s> tags. Right now I have:

contents.replace(/(<([^>] )>)/gi, '')

This remove all html tags.

So...

i tried many other solutions.

<\/?(?!s)\w*\b[^>]*>. <(?!s|/s).*?>.....

However these regex remove all tags containing the letter 's'.

For example, <strong> <span> and so on.

I'd really appreciate it if you could help me.

CodePudding user response：

Whether or not this is possible depends on how accurate you want to be. Regex cannot be used to 100% accurately parse HTML.

But if you just want something quick and dirty:

You can take advantage of the fact that String.prototype.replace allows you to differentiate between capture groups: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#specifying_a_function_as_the_replacement

So you can make two capture groups:

Group 1 (<s> or </s>): <\/?s>

Group 2: ("starts with <, ends with >, and has no > between"): (<[^>]*>)

Then when calling string.replace return the match if it matches group 1, else it has only matched group 2, so return an empty string:

function removeTags(text) {
  const regex = /(<\/?s>)|(<[^>]*>)/g; // Group 1 OR Group 2
  return text.replace(regex, (_, g1) => g1 || '');
}

let text = '<span>Span Text <s>S Text <strong>Strong Text</strong></s></span>';
console.log(removeTags(text));

Note the flaw: if < and > exist as text, everything in between may be considered a tag when it is not:

function removeTags(text) {
  const regex = /(<\/?s>)|(<[^>]*>)/g; // Group 1 OR Group 2
  return text.replace(regex, (_, g1) => g1 || '');
}

let text = '<p> This is how you start a tag: `<` and this is how you end a tag: `>`</p>';
console.log("But the regex fails:");
console.log(removeTags(text));

XML parsers can see that the brackets do not create a tag:

<p> This is how you start a tag: `<` and this is how you end a tag: `>`</p>

If you want accurate parsing, use an XML parser.

CodePudding user response：

You could try: /(<([^>s] )>)|(<\/?(\w{2,})>)/gmi

The first part (<([^>s] )>) will capture all html tags, except tag contain letter s.

The second part (<\/?(\w{2,})>) will capture all html tags which have 2 letters or more.

Demo: https://regex101.com/r/AFlXam/1