I started learning Python recently, and as a sort of challenge/project, I decided to try and create a "most common word finder."
To do this, I am using a website called Jisho, specifically, the #kanji pages. (This is the page I am using to test my code.) From these pages, the finder will look at the on and kun reading compounds (which are in the ul class no-bullet
), and then find and print the most common English word from this.
For code help, this blog post is mainly what I am using. VS Code is my IDE.
I have currently imported urllib.parse, requests, and BeautifulSoup from bs4, and my code currently looks like this:
kanji = '人'
parsed_kanji = urllib.parse.quote(kanji)
url = f'https://jisho.org/search/{parsed_kanji} #kanji'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
compounds = []
for li in soup.select('.no-bullet li'):
comp = ' '.join(li.text.split())
compounds.append(comp)
print(compounds)
(The code to find the most common word is not included.)
Everything works fine when print(compounds)
is not there, but when it is included, I get the following error message:
Traceback (most recent call last):
File "c:\Users\Lugnut\OneDrive\Desktop\frequent\most_common\test_list.py", line 22, in <module>
print(compounds)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u4eba' in position 2: character maps to <undefined>
Why is it that the print()
function causes my code to break?
CodePudding user response:
Originally, by using sys.stdout.reconfigure(encoding='utf-8')
in the file, I was able to get rid of the UnicodeEncodeError.
But, by setting the system locale to UTF-8 through this answer, and setting my font to the TrueType MS Mincho and switching my console window's code page to 65001 in VS Code (via chcp 65001
) through this answer, I was able to more permanently solve the UnicodeEncodeError. (Not having to use sys.stdout.reconfigure(encoding='utf-8-')
every time I want to use this kind of code.)