Home > Software engineering >  Why do I get a UnicodeEncodeError in Python only when there is a print statement?
Why do I get a UnicodeEncodeError in Python only when there is a print statement?

Time:09-03

I started learning Python recently, and as a sort of challenge/project, I decided to try and create a "most common word finder."

To do this, I am using a website called Jisho, specifically, the #kanji pages. (This is the page I am using to test my code.) From these pages, the finder will look at the on and kun reading compounds (which are in the ul class no-bullet), and then find and print the most common English word from this.

For code help, this blog post is mainly what I am using. VS Code is my IDE.

I have currently imported urllib.parse, requests, and BeautifulSoup from bs4, and my code currently looks like this:

kanji = '人'
parsed_kanji = urllib.parse.quote(kanji)

url = f'https://jisho.org/search/{parsed_kanji} #kanji'

page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")

compounds = []
for li in soup.select('.no-bullet li'):
    comp = ' '.join(li.text.split())
    compounds.append(comp)
print(compounds)

(The code to find the most common word is not included.)

Everything works fine when print(compounds) is not there, but when it is included, I get the following error message:

Traceback (most recent call last):
    File "c:\Users\Lugnut\OneDrive\Desktop\frequent\most_common\test_list.py", line 22, in <module>
        print(compounds)
    File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u4eba' in position 2: character maps to <undefined>

Why is it that the print() function causes my code to break?

CodePudding user response:

Originally, by using sys.stdout.reconfigure(encoding='utf-8') in the file, I was able to get rid of the UnicodeEncodeError.

But, by setting the system locale to UTF-8 through this answer, and setting my font to the TrueType MS Mincho and switching my console window's code page to 65001 in VS Code (via chcp 65001) through this answer, I was able to more permanently solve the UnicodeEncodeError. (Not having to use sys.stdout.reconfigure(encoding='utf-8-') every time I want to use this kind of code.)

  • Related