Home > other >  Encoding error whilst asynchronously scraping website
Encoding error whilst asynchronously scraping website

Time:12-05

I have the following code here:

import aiohttp
import asyncio
from bs4 import BeautifulSoup


async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get('https://www.pro-football-reference.com/years/2021/defense.htm') as response:
            soup = BeautifulSoup(await response.text(), features="lxml")
            print(soup)

asyncio.run(main())

But, it gives me the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2170985: invalid continuation byte for the line await response.text(). I believe the problem is that the url ends in a .htm instead of a .com.

Is there any way to decode it?
Note: I would not like to use response.read().

CodePudding user response:

The website's headers indicate that the page should be encoded as UTF-8, but evidently it isn't:

$ curl --head --silent https://www.pro-football-reference.com/years/2021/defense.htm  | grep -i charset
content-type: text/html; charset=UTF-8

Let's inspect the content:

>>> r = requests.get('https://www.pro-football-reference.com/years/2021/defense.htm')
>>> r.content[2170980:2170990]
b'/">Fu\xdfball'

It looks like this should be "Fußball", which would be b'Fu\xc3\x9fball' if encoded with UTF-8.

If we look up 0xdf in Triplee's Table of Legacy 8-bit Encodings we find that it represents "ß" in any of these encodings:

cp1250, cp1252, cp1254, cp1257, cp1258, iso8859_10, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_9, latin_1, palmos

Without any other information, I would choose latin-1 as the encoding; however it might be simpler to pass request.content to Beautiful Soup and let it handle decoding.

  • Related