The Script

Here's a simple script that uses requests to download a web page:

import requests
import pandas as pd

from bs4 import BeautifulSoup

url = "https://www.cmegroup.com/markets/energy/crude-oil/light-sweet-crude.quotes.html"

data = requests.get(url).text

However, it seems to hang at the call to requests.get.

If I use the following instead:

data = requests.get(url, allow_redirects=False, timeout=5).text

it outputs the following:

Traceback (most recent call last):
  File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\site-packages\urllib3\connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\site-packages\urllib3\connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\http\client.py", line 1374, in getresponse
    response.begin()
  File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\http\client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE   1), "iso-8859-1")
  File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\socket.py", line 705, in readinto
    return self._sock.recv_into(b)
  File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\ssl.py", line 1274, in recv_into
    return self.read(nbytes, buffer)
  File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\ssl.py", line 1130, in read
    return self._sslobj.read(len, buffer)
TimeoutError: The read operation timed out
...

Conda environment

I'm running this in a conda environment with the following packages:

(html-table-parse) PS C:\Users\dharm\Dropbox\Documents> conda list
# packages in environment at C:\Users\dharm\anaconda3\envs\html-table-parse:
#
# Name                    Version                   Build  Channel
beautifulsoup4            4.11.1             pyha770c72_0    conda-forge
brotlipy                  0.7.0           py310he2412df_1004    conda-forge
bzip2                     1.0.8                h8ffe710_4    conda-forge
ca-certificates           2022.6.15            h5b45459_0    conda-forge
certifi                   2022.6.15       py310h5588dad_0    conda-forge
cffi                      1.15.1          py310hcbf9ad4_0    conda-forge
charset-normalizer        2.1.0              pyhd8ed1ab_0    conda-forge
cryptography              37.0.1          py310h21b164f_0
idna                      3.3                pyhd8ed1ab_0    conda-forge
intel-openmp              2022.1.0          h57928b3_3787    conda-forge
libblas                   3.9.0              15_win64_mkl    conda-forge
libcblas                  3.9.0              15_win64_mkl    conda-forge
libffi                    3.4.2                h8ffe710_5    conda-forge
liblapack                 3.9.0              15_win64_mkl    conda-forge
libzlib                   1.2.12               h8ffe710_1    conda-forge
mkl                       2022.1.0           h6a75c08_874    conda-forge
numpy                     1.23.0          py310h8a5b91a_0    conda-forge
openssl                   3.0.5                h8ffe710_0    conda-forge
pandas                    1.4.3           py310hf5e1058_0    conda-forge
pip                       22.1.2             pyhd8ed1ab_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pyopenssl                 22.0.0             pyhd8ed1ab_0    conda-forge
pysocks                   1.7.1           py310h5588dad_5    conda-forge
python                    3.10.5          hcf16a7b_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.10                    2_cp310    conda-forge
pytz                      2022.1             pyhd8ed1ab_0    conda-forge
requests                  2.28.1             pyhd8ed1ab_0    conda-forge
setuptools                63.1.0          py310h5588dad_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
soupsieve                 2.3.1              pyhd8ed1ab_0    conda-forge
sqlite                    3.39.0               h8ffe710_0    conda-forge
tbb                       2021.5.0             h2d74725_1    conda-forge
tk                        8.6.12               h8ffe710_0    conda-forge
tzdata                    2022a                h191b570_0    conda-forge
ucrt                      10.0.20348.0         h57928b3_0    conda-forge
urllib3                   1.26.9             pyhd8ed1ab_0    conda-forge
vc                        14.2                 hb210afc_6    conda-forge
vs2015_runtime            14.29.30037          h902a5da_6    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
win_inet_pton             1.1.0           py310h5588dad_4    conda-forge
xz                        5.2.5                h62dcd97_1    conda-forge

Question

What's a good way to get the script to download the page as intended?

CodePudding user response：

Just add relevant headers, the website seems to be blocking requests without valid headers:

import requests

url = 'https://www.cmegroup.com/markets/energy/crude-oil/light-sweet-crude.quotes.html'
headers = {'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
           'Accept-Encoding': 'gzip, deflate',
           'Accept-Language': 'en-US,en;q=0.9',
           'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}

print(requests.get(url, headers=headers).text)