Home > Software design >  Error when passing URLs into a requests.get()
Error when passing URLs into a requests.get()

Time:10-01

I've been working on a program that takes URLs from a .csv and counts the word amount on the webpage. The URLs come from the rows under the "Article" column in a pandas dataframe. The URLs are inputted into a requests.get(url) set to a variable. In my investigation of the error, the problem arises when the URL is inputted into the requrests.get().

def file_input(file):
   #takes a .csv file from the user
   df = pd.read_csv(file, sep='[;,]', engine='python')
   for i in range(len(df)):
     df.at[i, "Word Count"] = word_counter(df.at[i, "Article"])
def word_counter(url):
  #keeps tracks of the page's word count
  count = 0
  #the requests.get(url) takes the string of url and gets the access of the webpage
  page = requests.get(url)

here are the error mesages:

Traceback (most recent call last):
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/urllib3/response.py", line 406, in _decode
    data = self._decoder.decompress(data)
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/urllib3/response.py", line 93, in decompress
    ret  = self._obj.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/requests/models.py", line 816, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/urllib3/response.py", line 627, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/urllib3/response.py", line 599, in read
    data = self._decode(data, decode_content, flush_decoder)
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/urllib3/response.py", line 409, in _decode
    raise DecodeError(
urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 59, in <module>
    main()
  File "main.py", line 44, in main
    file_input(file)
  File "main.py", line 35, in file_input
    df.at[i, "Word Count"] = word_counter(df.at[i, "Article"])
  File "main.py", line 13, in word_counter
    page = requests.get(anything)
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/requests/sessions.py", line 745, in send
    r.content
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/requests/models.py", line 899, in content
    self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
  File "/home/runner/Article-Word-counter/venv/lib/python3.8/site-packages/requests/models.py", line 820, in generate
    raise ContentDecodingError(e)
requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

CodePudding user response:

requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

It seems the server's response states that it is gzip-encoded, but requests failed to decode it when it treated it as gzipped. This could be a server misconfiguration, or something more subtle. Try to request a non-compressed response by specifying the Accept-Encoding header (though it is possible the server will not respect your request):

headers = { 'Accept-Encoding': 'identity' }
page = requests.get(url, headers=headers)

You can also check whether you can access the URL using other tools, like curl, or your web browser. Additionally, you can explicitly check the raw response to see what the server is actually sending you. But it seems contacting the webmaster of the URL in question might be the real solution.

  • Related