Home > Software engineering >  How to get all unique errors when searching through a lot of sites
How to get all unique errors when searching through a lot of sites

Time:07-04

I am going through a lot of sites using the request module and I want to see if the site is broken/exists/if I can access it. I am using a try/except function and can see what errors I get.

My issue: I have lots of sites to go through and don't know what errors can happen. I may have seen all of them but I don't know that.

Here are some examples of the errors that occurred:

Err: HTTPSConnectionPool(host='the_site', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')))
<class 'requests.exceptions.SSLError'>

Err: HTTPSConnectionPool(host='the_site', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: TLSV1_UNRECOGNIZED_NAME] tlsv1 unrecognized name (_ssl.c:1129)')))
<class 'requests.exceptions.SSLError'>

Err: HTTPSConnectionPool(host='the_site', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1129)')))
<class 'requests.exceptions.SSLError'>

Err: HTTPSConnectionPool(host='the_site', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001D58BAB4850>, 'Connection to the_site timed out. (connect timeout=10)'))
<class 'requests.exceptions.ConnectTimeout'>

Err: HTTPSConnectionPool(host='the_site', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001D58BAB48B0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
<class 'requests.exceptions.ConnectionError'>

Err: ('Connection aborted.'the_site', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
<class 'requests.exceptions.ConnectionError'>

Err: HTTPSConnectionPool(host='the_site', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001D58BB44C40>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))
<class 'requests.exceptions.ConnectionError'>

296 nan: is Not reachable 
Err: Invalid URL 'nan': No schema supplied. Perhaps you meant http://nan?

354 : is Not reachable, status_code: 404

As you can see they are all slightly different (even ignoring the object Id and the host)

I have tried:

try:
        #Get Url
        get = requests.get(url, allow_redirects=True, timeout=1,verify=True,headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"})
        # if the request succeeds
        if get.status_code == 200:
            print(f"{count} {url}: is reachable. status_code: {get.status_code}")
        else:
            print(f"{count} {url}: is Not reachable, status_code: {get.status_code}")
    #Exception
except requests.exceptions.RequestException as e:
        print(e.errno)
        print(f"{url}: is Not reachable \nErr: {e}")

but e.errno just returns a None value. I am not sure how it works but I expected it to return the unique number associated with that specific error but I was wrong I guess.

I also played around with all the other e.somthing and other things from the request module but I cant seem to find a way to get all the unique types of errors I am getting and will get later.

For clarification I am not talking about the classes like SSLError or ConnectionError.

TLDR: How to I can I get a list of all unique errors I am getting so I can search how to prevent those errors online.

CodePudding user response:

If you only want to produce a list of errors you are receiving without stopping your code, you can just use the base class of all exceptions: Exception:

Your code will then become:

try:
        #Get Url
        get = requests.get(url, allow_redirects=True, timeout=1,verify=True,headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"})
        # if the request succeeds
        if get.status_code == 200:
            print(f"{count} {url}: is reachable. status_code: {get.status_code}")
        else:
            print(f"{count} {url}: is Not reachable, status_code: {get.status_code}")
    #Exception
except Exception as e:
        print(f"{url}: is Not reachable \nErr: {e}")

Keep in mind that this obviously catches any and all errors than can occur, so make sure you log them properly to aid in debugging if the need ever arises.

CodePudding user response:

Replying just on that:

but e.errno just returns a None value. I am not sure how it works but I expected it to return the unique number associated with that specific error but I was wrong I guess.

No, that is not how it works.

"errno" is an old Unix error convention, see for example https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/errno.h.html listing them with their symbolic name. If you want those are "OS" errors.

Now if you look at Python exceptions mechanism, it is exactly defining an "OSError" exception which uses errno, see at https://docs.python.org/3.8/library/exceptions.html :

exception OSError(errno, strerror[, filename[, winerror[, filename2]]])

So if you want, Python exceptions are a superset of exceptions being at the OS level hence with some errno value. As such all other exceptions defined by libraries and your own code absolutely do not have to rely on this and there is no reason they would have an errno attribute.

(and this is good: how could all libraries and code settle for a single share sequence of numbers to encode their own exceptions? I wouldn't scale at all).

  • Related