Home > database >  how to work around encoding problems in redirects
how to work around encoding problems in redirects

Time:02-13

A website I try to scrape seems to have an encoding problem. The pages state, that they are encoded in utf-8, but if I try to scrape them and fetch the html source using requests, the redirect adress contains an encoding, that is not utf-8. Browsers seem to be tolerant, so they fix this automatically, but the python requests package runs into an exception.

My code looks like this:

res= rq.get(url, allow_redirects=True)

This runs into an exception when trying to decode the redirect string in the following code (hidden somewhere in the requests package):

string.decode(encoding)

where string is the redirect string and encoding is 'utf8':

string= b'/aktien/herm\xe8s-aktie'

I found out, that the encoding in fact is encoded in something like 'Windows-1252'. Actually the redirect should go on '/aktien/hermès-aktie'.

Now my question: how can I either get requests to be more tolerant about such encoding bugs (like the browsers), or how can I alternatively pass an encoding?

I searched for encoding settings, but what I saw so far, requests always does that automatically based on the result.

Btw. the result page of the redirect starts with (it really states to be utf-8)

<!DOCTYPE html><html lang="de" prefix="og: http://ogp.me/ns#"><head><meta charset="utf-8">

CodePudding user response:

You can use hooks= parameter in requests.get() method and explicitly urlencode the Location HTTP header. For example:

import requests
import urllib.parse

url = "<YOUR URL FROM EXAMPLE>"


def response_hook(hook_data, **kwargs):
    if "Location" in hook_data.headers:
        hook_data.headers["Location"] = urllib.parse.quote(
            hook_data.headers["Location"]
        )


res = requests.get(url, allow_redirects=True, hooks={"response": response_hook})
print(res.url)

Prints:

https://.../hermès-aktie
  • Related