Python3 - urllib.request.urlopen and readlines to utf-8?-CodePudding

Consider this example:

import urllib.request # Python3 URL loading

filelist_url="https://www.w3.org/TR/PNG/iso_8859-1.txt"
filelist_fobj = urllib.request.urlopen(filelist_url)
#filelist_fobj_fulltext = filelist_fobj.read().decode('utf-8')
#print(filelist_fobj_fulltext) # ok, works
lines = filelist_fobj.readlines()
print(type(lines[0]))

This code prints out the type of the first entry, of the result returned by readlines() of the file object for the .urlopen()'d URL as:

<class 'bytes'>

... and in fact, all of the entries in the returned list are of the same type.

I am aware that I could do .read().decode('utf-8') as in the commented lines, and then split that result on \n -- however, I'd like to know: is there otherwise any way, to use urlopen with .readlines(), and get a list of ("utf-8") strings?

CodePudding user response：

urllib.request.urlopen returns a http.client.HTTPResponse object, which implements the io.BufferedIOBase interface, which returns bytes.

The io module provides TextIOWrapper, which can wrap a BufferedIOBase object (or other similar objects) to add an encoding. The wrapped object's readlines method returns str objects decoded according to the coding you specified when you created the TextIOWrapper, so if you get the encoding right, everything will work. (On Unix-like systems, utf-8 is the default encoding, but apparently that's not the case on Windows. So if you want portability, you need to provide an encoding. I'll get back to that in a minute.)

So the following works fine:

>>> from urllib.request import urlopen
>>> from io import TextIOWrapper
>>> url="https://www.w3.org/TR/PNG/iso_8859-1.txt"
>>> with urlopen(url) as response:
...   lines = TextIOWrapper(response, encoding='utf-8'):
... 
>>> for line in lines[:5]: print(type(line), line.strip())
... 
<class 'str'> The following are the graphical (non-control) characters defined by
<class 'str'> ISO 8859-1 (1987).  Descriptions in words aren't all that helpful,
<class 'str'> but they're the best we can do in text.  A graphics file illustrating
<class 'str'> the character set should be available from the same archive as this
<class 'str'> file.

It's worth noting that both the HTTPResponse object and the TextIOWrapper which wraps it implement the iterator protocol, so you can use a loop like for line in TextIOWrapper(response, ...): instead of saving the entire web page using readlines(). The iterator protocol can be a big win because it lets you start processing the web page before it has all been downloaded.

Since I work on a Linux system, I could have left out the encoding='utf-8' argument to TextIOWrapper, but regardless, the assumption is that I know that the file is UTF-8 encoded. That's a pretty safe assumption, but it's not universally valid. According to W3Techs survey (updated daily, at least when I wrote this answer), 97.6% of websites use UTF-8 encoding, which means that one in 40 does not. (If you restrict the survey to what W3Techs considers the top 1,000 sites, the percentage increases to 98.7%. But that's still not universal.)

Now, the conventional wisdom, which you'll find in a number of SO answers, is that you should dig the encoding out of the HTTP headers, which you can fairly easily do:

>>> # Tempting though this is, DO NOT DO IT. See below.
>>> with urlopen(url) as response:
...   lines = TextIOWrapper(response,
...                         encoding=response.headers.get_content_charset()
...                        ).readlines()
...

Unfortunately, that will only work if the website declares the content encoding in the HTTP headers, and many sites prefer to put the encoding in a meta tag. So when I tried the above with a randomly-selected Windows-1252-encoded site (taken from the W3Techs survey), it failed with an encoding error:

>>> with urlopen(win1252_url) as response:
...   lines = TextIOWrapper(response, 
...                         encoding=response.headers.get_content_charset()
...                        ).readlines()
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 346: invalid continuation byte

Note that although the page is encoded in Windows-1252, that information was not provided in the HTTP headers, so TextIOWrapper chose the default encoding, which on my system is UTF-8. If I supply the correct encoding, I can read the page without problems, letting me see the encoding declaration in the page itself.

>>> with urlopen(win1252_url) as response:
...   lines = TextIOWrapper(response,
...                         encoding='Windows-1252'
...                        ).readlines()
... 
... print(lines[3].strip())>>> print(lines[3].strip())
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

Clearly, if the encoding is declared in the content itself, it's not possible to set the encoding before reading the content. So what to do in these cases?

The most general solution, and the simplest to code, appears to be the well-known BeautifulSoup package, which is capable of using a variety of techniques to detect the character encoding. Unfortunately, that requires parsing the entire page, which is a much more time-consuming task than just reading lines.

Another option would be to read the first kilobyte or so of the webpage, as bytes, and then try to find a meta tag. Content provider are supposed to put the meta tag close to the beginning of the web page, and it certainly has to come before the first non-ascii character. If you don't find a meta tag and there is no character encoding declared in the HTTP headers, then you could try to use a heuristic encoding detector on the bytes of the file already read.

The one thing you shouldn't do is rely on the character encoding declared in the HTTP header, regardless of the many suggestions to do so, which you will find here and elsewhere on the web. As we've already seen, the headers often don't contain this information anyway, but even when they do, it is often wrong anyway, because for a web designer it's easier to declare the encoding in the page itself than to reconfigure the server to send the correct headers. So you can't really rely on the HTTP header, and you should only use it if you have no other information to go on.