Home > Enterprise >  How to remove html elements from a string but exclude a specific element with regex
How to remove html elements from a string but exclude a specific element with regex

Time:04-28

I have a string '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>'

I need to remove html tags and leave the text

import re
p = re.compile( '\s*<[^>] >\s*')
test = p.sub('', '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>')
print(test)

OUTPUT: TEST1TEST2TEST3

But this removes every html element, how should I change regex so that the output would be like this:

OUTPUT: TEST1 TEST2 <a href="#">TEST3</a>

CodePudding user response:

You can work with the so-called "Negative Lookaheads".

In your case, you can leave out <a and </a>:

(?!<a )(?!<\/a>)<[^>] >

Note the space in <a and the closing parenthesis in </a> so that only the opening and closing tags of an <a> element match and nothing else begins with an a.

  • Related