I have HTML with Cyrillic characters. I am using BeautifulSoup4 to process this. It works great, but when I go to prettify, it converts all the Cyrillic characters to something else. Here is a dummy example using Python3:
from bs4 import BeautifulSoup
hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("Before prettify:\n{}".format(soup))
soup = soup.prettify(formatter='html')
print("\nafter prettify:\n{}".format(soup))
Here is the output it generates:
Before prettify:
<span>Привет, мир</span>
after prettify:
<span>
Привет, мир
</span>
It's formatting the HTML properly (putting the tags on their lines), but it's converting the Cyrillic characters to something else (I'm not even certain what encoding that is, to be honest.)
I have tried various things to prevent this; prettify(encoding=None, formatter='html')
, prettify(encoding='utf-8', formatter='html')
, I have also tried changing the way I create the soup object: soup = BeautifulSoup(hello.encode('utf-8'), 'html.parser')
and soup = BeautifulSoup(hello, 'html.parser', from_encoding='utf-8')
- nothing seems to change what happens to the Cyrillic characters during prettify.
I figure this must be a very simple mistake I am making with encoding parameters somewhere, but after searching the internet and BS4 documentation, I am unable to figure this out. Is there a way to use BeautifulSoup's prettify, but maintain the Cyrillic characters as they were originally, or is this not possible?
EDIT: I have realized now (thanks to DYZ's answer), that removing formatter='html'
from the call to prettify will stop BeautifulSoup from converting the Cyrillic chars. Unfortunately, this also removes any  
chars in the document. After having a look at BS4's output-formatters documentation, it seems the solution is likely to create a custom formatter using BS's Formatter class, and specifying this in the call to prettify - soup.prettify(formatter=my_formatter)
. I'm not sure yet what that would entail, though.
CodePudding user response:
From the documentation:
If you pass in formatter="html", Beautiful Soup will convert Unicode characters to HTML entities whenever possible.
If this is not desirable, do not use the HTML formatter:
soup.prettify()
#'<span>\n Привет, мир\n</span>'