I was working with text data, I want to remove anything HTML code that is things with "<" and ">". For example
<< HTML > < p style="text-align:justify" >Labour Solutions Australia (LSA) is a national labour hire and sourcing `
So I use the following code
def remove_html(s):
s = re.sub('[^\S]*<[^\S]*', "", s)
s = re.sub('[^\S]*>[^\S]*', "", s)
return s
With the execution of the code we get the following result
Solutions Australia LSA is a national labour hire and sourcing
I don't want to remove the word Labour but it get remove as it's stick with '>'. Is there any way I can save it? Please suggest
CodePudding user response:
import re
def remove_html(data):
return re.sub('<[^>] >', '', data).strip()
test_case = '< HTML > < p style="text-align:justify" >Labour Solutions Australia (LSA) is a national labour hire and sourcing'
print(remove_html(test_case))
Output:
Labour Solutions Australia (LSA) is a national labour hire and sourcing