Home > front end >  Python -- Cannot properly decode a title tag from a Japanese website
Python -- Cannot properly decode a title tag from a Japanese website

Time:08-29

When I issue a requests.get request for website design-dogs[.]com, the HTML that's returned is not decoded properly.

response_size = 0
with requests.get("https://design-dogs[.]com", stream = True) as response:
  for chunk in response.iter_content(chunk_size = 1000000, decode_unicode = True):
    response_size  = len(chunk)
    if response_size > 2048000:
      file_buffer = ""
      response.close()
      print(file_buffer)
      sys.exit(1)

    file_buffer  = chunk

  response.close()

print(file_buffer)

Output, title excerpt only:

æ ªå¼ä¼šç¤¾ デザインドッグス

When it should be:

株式会社 デザインドッグス

Why is this happening? This doesn't occur on any other website.

CodePudding user response:

The server is not returning any encoding in response headers:

import requests

response = requests.get("https://design-dogs.com")
print(response.headers)

Prints:

{
    "Server": "nginx",
    "Date": "Sun, 28 Aug 2022 17:04:12 GMT",
    "Content-Type": "text/html",                         # <--- missing UTF-8
    "Last-Modified": "Wed, 20 Jul 2022 04:13:38 GMT",
    "Transfer-Encoding": "chunked",
    "Connection": "keep-alive",
    "ETag": 'W/"62d780f2-39d9"',
    "X-Powered-By": "PleskLin",
    "Content-Encoding": "br",
}

so requests is using wrong encoding:

print(response.encoding)

Prints:

ISO-8859-1

The fact that web browser is displaying the webpage correctly is because there is <meta charset="utf-8"> tag at the beginning of the page.

So to display the HTML correctly you can do:

response.encoding = "utf-8"
print(response.text)

# OR:

print(response.content.decode("utf-8"))

With your code snippet:

file_buffer = b""
response_size = 0

with requests.get("https://design-dogs.com", stream=True) as response:
    for chunk in response.iter_content(chunk_size=1_000_000):
        response_size  = len(chunk)

        if response_size > 2_048_000:
            file_buffer = b""
            break

        file_buffer  = chunk

print(file_buffer.decode("utf-8"))
  • Related