Ambivalent encoding of BeautifulSoup object constructed from website with duplicate meta header. How-CodePudding

I have fetched data from a website using BeautifulSoup module. I know from meta header that the source encoding for this document is 'iso-8859-1'. I also know that BeutifulSoup automatically transcode to 'UTF-8' upon creation of BeautifulSoup object.

import requests
from bs4 import BeautifulSoup

url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.content, 'lxml')

print(soup_data.prettify())

Unfortunately, the website has a duplicate element.

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Upon inspection of the BeautifulSoup object using prettify, I realized that BeautifulSoup converted only one of these meta tags.

<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/html; charset=iso-8859-1" http-eqiv="Content-Type"/>

I'm therefore confused what is the actual encoding of my BeautifulSoup object.

Also, during data processing I realized that some of text elements of this object are not properly parsed by my PyCharm console. These strings are 'iso-8859-1' code characters. Therefore, I suspect that the object is either still in ISO encoding or even worse, somehow mixed up.

['\xa0\xa0\xa0\xa0M. le président.' '\xa0\xa0\xa0\xa0M. le président.'

I have seen these ISO characters for the first time after I run a numpy function.

series = np.apply_along_axis(lambda x: x[0].get_text(), 0, [df])

Any suggestions on how to proceed from this situation? I would like to convert the object to UTF-8 (and be 100% sure it's fully in UTF-8).

CodePudding user response：

To ensure that you are using the correct encoding, you could use EncodingDetector packaged with bs4:

import requests
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector

url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp"
r = requests.get(url)

encoding = EncodingDetector.find_declared_encoding(r.content, is_html=True)
soup_data = BeautifulSoup(r.content, "lxml", from_encoding=encoding)

print(soup_data.prettify())

CodePudding user response：

BeautifulSoup used the ISO-8859-1 encoding to decode the r.content (a bytes object) into Unicode (a str object). A str is not encoded at all. It is made of of Unicode code points.

It turns out the data wasn't encoded in ISO-8859-1. It was encoded in Windows-1252, a similar encoding with a few extra translations (see the hyperlinks for each).

The requests response indicates the encoding the website used (r.encoding) and the apparent encoding using its detection code (r.apparent_encoding). Here are some differences in the actual text I found:

import requests
from bs4 import BeautifulSoup

url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp"
r=requests.get(url)
print(f'{r.encoding=}')
print(f'{r.apparent_encoding=}')
print()
soup_data=BeautifulSoup(r.content, 'lxml')
print(repr(soup_data.find('a',href="http://www2.assemblee-nationale.fr/scrutins/liste/(legislature)/15/(type)/AUT").text))
print(repr(soup_data.find('a',href="#",accesskey="0").text))
print()
#Using the correct encoding
soup_data=BeautifulSoup(r.content, 'lxml', from_encoding='Windows-1252')
print(repr(soup_data.find('a',href="http://www2.assemblee-nationale.fr/scrutins/liste/(legislature)/15/(type)/AUT").text))
print(repr(soup_data.find('a',href="#",accesskey="0").text))

Output. Note the \x85 and \x92 code points in "censure…" and "d’accessibilité" in the first instance. The … (U 2026) and ’ (U 2019) code points don't exist in ISO-8859-1 and the bytes 0x85 and 0x92 translate to U 0085 and U 0092 respectively which are unprintable control codes. I've used repr() to show them as escape codes.

r.encoding='ISO-8859-1'
r.apparent_encoding='Windows-1252'

'Autres scrutins solennels (déclarations, motions de censure\x85)'
'Politique d\x92accessibilité'

'Autres scrutins solennels (déclarations, motions de censure…)'
'Politique d’accessibilité'