A website I try to scrape seems to have an encoding problem. The pages state, that they are encoded in utf-8, but if I try to scrape them and fetch the html source using requests, the redirect adress contains an encoding, that is not utf-8. Browsers seem to be tolerant, so they fix this automatically, but the python requests package runs into an exception.
My code looks like this:
res= rq.get(url, allow_redirects=True)
This runs into an exception when trying to decode the redirect string in the following code (hidden somewhere in the requests package):
string.decode(encoding)
where string is the redirect string and encoding is 'utf8':
string= b'/aktien/herm\xe8s-aktie'
I found out, that the encoding in fact is encoded in something like 'Windows-1252'. Actually the redirect should go on '/aktien/hermès-aktie'.
Now my question: how can I either get requests to be more tolerant about such encoding bugs (like the browsers), or how can I alternatively pass an encoding?
I searched for encoding settings, but what I saw so far, requests always does that automatically based on the result.
Btw. the result page of the redirect starts with (it really states to be utf-8)
<!DOCTYPE html><html lang="de" prefix="og: http://ogp.me/ns#"><head><meta charset="utf-8">
CodePudding user response:
You can use hooks=
parameter in requests.get()
method and explicitly urlencode the Location
HTTP header. For example:
import requests
import urllib.parse
url = "<YOUR URL FROM EXAMPLE>"
def response_hook(hook_data, **kwargs):
if "Location" in hook_data.headers:
hook_data.headers["Location"] = urllib.parse.quote(
hook_data.headers["Location"]
)
res = requests.get(url, allow_redirects=True, hooks={"response": response_hook})
print(res.url)
Prints:
https://.../hermès-aktie