Home > other >  Remove unicode HTML tags in Python
Remove unicode HTML tags in Python

Time:10-30

I have a string from which I would like to remove the HTML tags.

"overview":"\u003cp style=\"margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;\"\u003e\u003cspan style=\"font-family: arial, helvetica, sans-serif;\"\u003e\u003cstrong\u003e\u003cspan style=\"font-size: 18pt; outline: none !important;\"\u003eWTS/VDI macOS\u003c/span\u003e\u003c/strong\u003e\u003c/span\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cspan ......

I would like to just have

"overview":"WTS/VDI macOS.....

I tried with BeautifulSoap and Python Bleach, but it only recognizes if the tags are written in '<' and '>' format. Is there a library or any function which removes this for me? Or should I convert the unicode characters and do it manually?

CodePudding user response:

You string is presumably from somewhere else if you encode it as utf-8 then it becomes:

'"overview":"<p style="margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;"><span style="font-family: arial, helvetica, sans-serif;"><strong><span style="font-size: 18pt; outline: none !important;">WTS/VDI macOS</span></strong></span></p>\n<hr>\n<p><span...' which tools such as BeautifulSoap should handle.

Note that if your string is the result of a subprocess.run with capture turned on then adding a encoding="utf-8" parameter should do this for you.

CodePudding user response:

This should do!

content = "overview":"\u003cp style=\"margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;\"\u003e\u003cspan style=\"font-family: arial, helvetica, sans-serif;\"\u003e\u003cstrong\u003e\u003cspan style=\"font-size: 18pt; outline: none !important;\"\u003eWTS/VDI macOS\u003c/span\u003e\u003c/strong\u003e\u003c/span\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cspan ......

soup = BeautifulSoup(content.encode('utf-8').decode('unicode-escape'))
for tag in soup():
 for attribute in ["class", "id", "name", "style"]:
   del tag[attribute]
    
new_s = os.linesep.join([s for s in soup.text.splitlines() if s])
print(new_s)

This will give me

"overview":"WTS/VDI (Citrix) - macOS WTS (Windows Terminal Services) and VDI (Virtual Desktop Instance) provide you....

without the HTML Tags!

CodePudding user response:

You can just encode as utf-8 before passing to BeautifulSoup (since it also accepts input in bytes, not just str) and then extracting the text:

# pasted your string excerpt to variable xstr: 
xstr = '"overview":"\u003cp style=\"margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;\"\u003e\u003cspan style=\"font-family: arial, helvetica, sans-serif;\"\u003e\u003cstrong\u003e\u003cspan style=\"font-size: 18pt; outline: none !important;\"\u003eWTS/VDI macOS\u003c/span\u003e\u003c/strong\u003e\u003c/span\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cspan ......'

print(BeautifulSoup(xstr.encode('utf-8')).text)

output: "overview":"WTS/VDI macOS

  • Related