Home > Back-end >  Replacing HTML but saving the word sticking at the end
Replacing HTML but saving the word sticking at the end

Time:12-25

I was working with text data, I want to remove anything HTML code that is things with "<" and ">". For example

<< HTML > < p style="text-align:justify" >Labour Solutions Australia (LSA) is a national labour hire and sourcing `

So I use the following code

def remove_html(s):
    
    s = re.sub('[^\S]*<[^\S]*', "", s)
    s = re.sub('[^\S]*>[^\S]*', "", s)
    return s

With the execution of the code we get the following result

Solutions Australia LSA is a national labour hire and sourcing

I don't want to remove the word Labour but it get remove as it's stick with '>'. Is there any way I can save it? Please suggest

CodePudding user response:

import re
def remove_html(data):
    return re.sub('<[^>] >', '', data).strip()

test_case = '< HTML > < p style="text-align:justify" >Labour Solutions Australia (LSA) is a national labour hire and sourcing'
print(remove_html(test_case))

Output:

Labour Solutions Australia (LSA) is a national labour hire and sourcing

  • Related