Home > Software design >  How can I remove all HTML tags in a string except for the only 's' tag?
How can I remove all HTML tags in a string except for the only 's' tag?

Time:01-27

I'm trying to remove all html tags except only <s></s> tags. Right now I have:

contents.replace(/(<([^>] )>)/gi, '')

This remove all html tags.

So...

i tried many other solutions.

<\/?(?!s)\w*\b[^>]*>. <(?!s|/s).*?>.....

However these regex remove all tags containing the letter 's'.

For example, <strong> <span> and so on.

I'd really appreciate it if you could help me.

CodePudding user response:

Whether or not this is possible depends on how accurate you want to be. Regex cannot be used to 100% accurately parse HTML.

But if you just want something quick and dirty:

You can take advantage of the fact that String.prototype.replace allows you to differentiate between capture groups: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#specifying_a_function_as_the_replacement

So you can make two capture groups:

Group 1 (<s> or </s>): <\/?s>

Group 2: ("starts with <, ends with >, and has no > between"): (<[^>]*>)

Then when calling string.replace return the match if it matches group 1, else it has only matched group 2, so return an empty string:

function removeTags(text) {
  const regex = /(<\/?s>)|(<[^>]*>)/g; // Group 1 OR Group 2
  return text.replace(regex, (_, g1) => g1 || '');
}

let text = '<span>Span Text <s>S Text <strong>Strong Text</strong></s></span>';
console.log(removeTags(text));


Note the flaw: if < and > exist as text, everything in between may be considered a tag when it is not:

function removeTags(text) {
  const regex = /(<\/?s>)|(<[^>]*>)/g; // Group 1 OR Group 2
  return text.replace(regex, (_, g1) => g1 || '');
}

let text = '<p> This is how you start a tag: `<` and this is how you end a tag: `>`</p>';
console.log("But the regex fails:");
console.log(removeTags(text));
XML parsers can see that the brackets do not create a tag:

<p> This is how you start a tag: `<` and this is how you end a tag: `>`</p>

If you want accurate parsing, use an XML parser.

CodePudding user response:

You could try: /(<([^>s] )>)|(<\/?(\w{2,})>)/gmi

The first part (<([^>s] )>) will capture all html tags, except tag contain letter s.

The second part (<\/?(\w{2,})>) will capture all html tags which have 2 letters or more.

Demo: https://regex101.com/r/AFlXam/1

  • Related