I am tyring to open an url like following:
import urllib.request
url = "https://www.chess.cornell.edu/index.php/users/calculato rs/calculator-absolute-flux-measurement-using-xpd100"
# I tried to access to this url.
req = urllib.request.Request(
url,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
# using the user agent like many answers suggested.
f = urllib.request.urlopen(req)
However, I always got the error like following:
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Not Found
Thanks a lot for any helps!
CodePudding user response:
I used the requests library and it worked fine
import request
r = requests.get("https://www.chess.cornell.edu/index.php/users/calculato rs/calculator-absolute-flux-measurement-using-xpd100")
even though it returns a,
<Response [404]>
you can still use r.text
to get the html of the site
this probably happens because the site returns a status 404
(Not found) even though it actually returns a valid page. While urllib
panics and throws an error your browser and requests
will still follow through and show us the page.
Glad if this helps :)
CodePudding user response:
If you need to fetch response body even for 404 error, this is how it's done using urllib
:
try:
f = urllib.request.urlopen(req)
except urllib.error.HTTPError as err:
f = err
This is a very simplistic snippet, of course, assuming you want to do f.read()
later on to process the content. In a robust program there should be all kinds of checks for HTTP response code, content types and so on.
There is nothing wrong with using requests
(as suggested by @DeeraWijesundara), of course. In fact, I would personally use requests
too in a similar case, but for completeness' sake I've decided to add an stdlib-only answer.