I have the following code here:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('https://www.pro-football-reference.com/years/2021/defense.htm') as response:
soup = BeautifulSoup(await response.text(), features="lxml")
print(soup)
asyncio.run(main())
But, it gives me the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2170985: invalid continuation byte
for the line await response.text()
. I believe the problem is that the url ends in a .htm
instead of a .com
.
Is there any way to decode it?
Note: I would not like to use response.read().
CodePudding user response:
The website's headers indicate that the page should be encoded as UTF-8, but evidently it isn't:
$ curl --head --silent https://www.pro-football-reference.com/years/2021/defense.htm | grep -i charset
content-type: text/html; charset=UTF-8
Let's inspect the content:
>>> r = requests.get('https://www.pro-football-reference.com/years/2021/defense.htm')
>>> r.content[2170980:2170990]
b'/">Fu\xdfball'
It looks like this should be "Fußball", which would be b'Fu\xc3\x9fball'
if encoded with UTF-8.
If we look up 0xdf
in Triplee's Table of Legacy 8-bit Encodings we find that it represents "ß" in any of these encodings:
cp1250, cp1252, cp1254, cp1257, cp1258, iso8859_10, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_9, latin_1, palmos
Without any other information, I would choose latin-1 as the encoding; however it might be simpler to pass request.content
to Beautiful Soup and let it handle decoding.