Home > Net >  Perserving source html entities with BeautifulSoup
Perserving source html entities with BeautifulSoup

Time:09-08

Is there any way I can perserve HTML entities in the source when parsing it with BeautifulSoup?

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p >&quot;Hello World!&quot; I said')
print(soup.string)
# Outputs: '"Hello World!" I said'
# Wanted/Expected: '&quot;Hello World!&quot; I said'

Also, when writing those preserved html entities back to a file. Will f.write(str(soup)) do? The following code to is meant produce an identical copy of the original, which currently isn't:

from bs4 import BeautifulSoup
from pathlib import Path

# The original contains tons of HTML entities
original = Path("original.html")
output = Path("duplicate.html")

with open(original, "rt", encoding="utf8") as f:
    soup = BeautifulSoup(f, "lxml")

with open(output, "wt", encoding="utf8") as f:
    f.write(str(soup))

CodePudding user response:

you have to create custom formatter

from bs4 import BeautifulSoup

def formatQuot(string):
    return string.replace('"','&quot;')
    
soup = BeautifulSoup('<p >&quot;Hello World!&quot; I said ', 'html.parser')
print(soup.decode(formatter=formatQuot))
# <p >&quot;Hello World!&quot; I said </p>

text = formatQuot(soup.text)
print(text)
# &quot;Hello World!&quot; I said

CodePudding user response:

Thanks @uingtea for custom formatter suggestion. As I also need to preserve the tag attribute order, I've subclassed the HTMLFormatter as per BeautifulSoup docs:

from pathlib import Path
from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter

def my_formatter(string):
    string = string.replace('&', '&amp;')
    # string = string.replace('…', '&hellip;')
    string = string.replace('"', '&quot;').replace("'", '&#39;')
    string = string.replace('<', '&lt;').replace('>', '&gt;')
    return string

class customFormat(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            yield k, v

cform = customFormat(my_formatter)

original = Path("original.html")
output   = Path("output.html")

with open(original, "rt", encoding="utf8") as f:
    soup = BeautifulSoup(f, "lxml")

with open(output, "wt", encoding="utf8", newline="\n") as f:
    f.write(soup.decode(formatter=cform))

Is there a more "cleaner" way to write the custom parser, i.e. without defining a free function then passing it to the constructor of the subclassed formatter? The docs is pretty scant on how to write a custom/subclassed formatter.

  • Related