I have a string '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>'
I need to remove html tags and leave the text
import re
p = re.compile( '\s*<[^>] >\s*')
test = p.sub('', '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>')
print(test)
OUTPUT: TEST1TEST2TEST3
But this removes every html element, how should I change regex so that the output would be like this:
OUTPUT: TEST1 TEST2 <a href="#">TEST3</a>
CodePudding user response:
You can work with the so-called "Negative Lookaheads".
In your case, you can leave out <a
and </a>
:
(?!<a )(?!<\/a>)<[^>] >
Note the space in <a
and the closing parenthesis in </a>
so that only the opening and closing tags of an <a>
element match and nothing else begins with an a.