Is there any way I can perserve HTML entities in the source when parsing it with BeautifulSoup?
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p >"Hello World!" I said')
print(soup.string)
# Outputs: '"Hello World!" I said'
# Wanted/Expected: '"Hello World!" I said'
Also, when writing those preserved html entities back to a file. Will f.write(str(soup))
do? The following code to is meant produce an identical copy of the original, which currently isn't:
from bs4 import BeautifulSoup
from pathlib import Path
# The original contains tons of HTML entities
original = Path("original.html")
output = Path("duplicate.html")
with open(original, "rt", encoding="utf8") as f:
soup = BeautifulSoup(f, "lxml")
with open(output, "wt", encoding="utf8") as f:
f.write(str(soup))
CodePudding user response:
you have to create custom formatter
from bs4 import BeautifulSoup
def formatQuot(string):
return string.replace('"','"')
soup = BeautifulSoup('<p >"Hello World!" I said ', 'html.parser')
print(soup.decode(formatter=formatQuot))
# <p >"Hello World!" I said </p>
text = formatQuot(soup.text)
print(text)
# "Hello World!" I said
CodePudding user response:
Thanks @uingtea for custom formatter suggestion. As I also need to preserve the tag attribute order, I've subclassed the HTMLFormatter as per BeautifulSoup docs:
from pathlib import Path
from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter
def my_formatter(string):
string = string.replace('&', '&')
# string = string.replace('…', '…')
string = string.replace('"', '"').replace("'", ''')
string = string.replace('<', '<').replace('>', '>')
return string
class customFormat(HTMLFormatter):
def attributes(self, tag):
for k, v in tag.attrs.items():
yield k, v
cform = customFormat(my_formatter)
original = Path("original.html")
output = Path("output.html")
with open(original, "rt", encoding="utf8") as f:
soup = BeautifulSoup(f, "lxml")
with open(output, "wt", encoding="utf8", newline="\n") as f:
f.write(soup.decode(formatter=cform))
Is there a more "cleaner" way to write the custom parser, i.e. without defining a free function then passing it to the constructor of the subclassed formatter? The docs is pretty scant on how to write a custom/subclassed formatter.