When I issue a requests.get request for website design-dogs[.]com, the HTML that's returned is not decoded properly.
response_size = 0
with requests.get("https://design-dogs[.]com", stream = True) as response:
for chunk in response.iter_content(chunk_size = 1000000, decode_unicode = True):
response_size = len(chunk)
if response_size > 2048000:
file_buffer = ""
response.close()
print(file_buffer)
sys.exit(1)
file_buffer = chunk
response.close()
print(file_buffer)
Output, title excerpt only:
æ ªå¼ä¼šç¤¾ デザインドッグスWhen it should be:
株式会社 デザインドッグスWhy is this happening? This doesn't occur on any other website.
CodePudding user response:
The server is not returning any encoding in response headers:
import requests
response = requests.get("https://design-dogs.com")
print(response.headers)
Prints:
{
"Server": "nginx",
"Date": "Sun, 28 Aug 2022 17:04:12 GMT",
"Content-Type": "text/html", # <--- missing UTF-8
"Last-Modified": "Wed, 20 Jul 2022 04:13:38 GMT",
"Transfer-Encoding": "chunked",
"Connection": "keep-alive",
"ETag": 'W/"62d780f2-39d9"',
"X-Powered-By": "PleskLin",
"Content-Encoding": "br",
}
so requests
is using wrong encoding:
print(response.encoding)
Prints:
ISO-8859-1
The fact that web browser is displaying the webpage correctly is because there is <meta charset="utf-8">
tag at the beginning of the page.
So to display the HTML correctly you can do:
response.encoding = "utf-8"
print(response.text)
# OR:
print(response.content.decode("utf-8"))
With your code snippet:
file_buffer = b""
response_size = 0
with requests.get("https://design-dogs.com", stream=True) as response:
for chunk in response.iter_content(chunk_size=1_000_000):
response_size = len(chunk)
if response_size > 2_048_000:
file_buffer = b""
break
file_buffer = chunk
print(file_buffer.decode("utf-8"))