The Script
Here's a simple script that uses requests
to download a web page:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.cmegroup.com/markets/energy/crude-oil/light-sweet-crude.quotes.html"
data = requests.get(url).text
However, it seems to hang at the call to requests.get
.
If I use the following instead:
data = requests.get(url, allow_redirects=False, timeout=5).text
it outputs the following:
Traceback (most recent call last):
File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\site-packages\urllib3\connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\site-packages\urllib3\connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\http\client.py", line 1374, in getresponse
response.begin()
File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\http\client.py", line 279, in _read_status
line = str(self.fp.readline(_MAXLINE 1), "iso-8859-1")
File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\socket.py", line 705, in readinto
return self._sock.recv_into(b)
File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\ssl.py", line 1274, in recv_into
return self.read(nbytes, buffer)
File "C:\Users\dharm\anaconda3\envs\html-table-parse\lib\ssl.py", line 1130, in read
return self._sslobj.read(len, buffer)
TimeoutError: The read operation timed out
...
Conda environment
I'm running this in a conda environment with the following packages:
(html-table-parse) PS C:\Users\dharm\Dropbox\Documents> conda list
# packages in environment at C:\Users\dharm\anaconda3\envs\html-table-parse:
#
# Name Version Build Channel
beautifulsoup4 4.11.1 pyha770c72_0 conda-forge
brotlipy 0.7.0 py310he2412df_1004 conda-forge
bzip2 1.0.8 h8ffe710_4 conda-forge
ca-certificates 2022.6.15 h5b45459_0 conda-forge
certifi 2022.6.15 py310h5588dad_0 conda-forge
cffi 1.15.1 py310hcbf9ad4_0 conda-forge
charset-normalizer 2.1.0 pyhd8ed1ab_0 conda-forge
cryptography 37.0.1 py310h21b164f_0
idna 3.3 pyhd8ed1ab_0 conda-forge
intel-openmp 2022.1.0 h57928b3_3787 conda-forge
libblas 3.9.0 15_win64_mkl conda-forge
libcblas 3.9.0 15_win64_mkl conda-forge
libffi 3.4.2 h8ffe710_5 conda-forge
liblapack 3.9.0 15_win64_mkl conda-forge
libzlib 1.2.12 h8ffe710_1 conda-forge
mkl 2022.1.0 h6a75c08_874 conda-forge
numpy 1.23.0 py310h8a5b91a_0 conda-forge
openssl 3.0.5 h8ffe710_0 conda-forge
pandas 1.4.3 py310hf5e1058_0 conda-forge
pip 22.1.2 pyhd8ed1ab_0 conda-forge
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pyopenssl 22.0.0 pyhd8ed1ab_0 conda-forge
pysocks 1.7.1 py310h5588dad_5 conda-forge
python 3.10.5 hcf16a7b_0_cpython conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python_abi 3.10 2_cp310 conda-forge
pytz 2022.1 pyhd8ed1ab_0 conda-forge
requests 2.28.1 pyhd8ed1ab_0 conda-forge
setuptools 63.1.0 py310h5588dad_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
soupsieve 2.3.1 pyhd8ed1ab_0 conda-forge
sqlite 3.39.0 h8ffe710_0 conda-forge
tbb 2021.5.0 h2d74725_1 conda-forge
tk 8.6.12 h8ffe710_0 conda-forge
tzdata 2022a h191b570_0 conda-forge
ucrt 10.0.20348.0 h57928b3_0 conda-forge
urllib3 1.26.9 pyhd8ed1ab_0 conda-forge
vc 14.2 hb210afc_6 conda-forge
vs2015_runtime 14.29.30037 h902a5da_6 conda-forge
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
win_inet_pton 1.1.0 py310h5588dad_4 conda-forge
xz 5.2.5 h62dcd97_1 conda-forge
Question
What's a good way to get the script to download the page as intended?
CodePudding user response:
Just add relevant headers, the website seems to be blocking requests without valid headers:
import requests
url = 'https://www.cmegroup.com/markets/energy/crude-oil/light-sweet-crude.quotes.html'
headers = {'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
print(requests.get(url, headers=headers).text)