Home > Net >  Set DNS timeout for HTTP requests using requests library
Set DNS timeout for HTTP requests using requests library

Time:11-07

I have a function that is meant to check if a specific HTTP(S) URL is a redirect and if so return the new location (but not recursively). It uses the requests library. It looks like this:

    try:
        response = http_session.head(sent_url, timeout=(1, 1))
        if response.is_redirect:
            return response.headers["location"]
        return sent_url
    except requests.exceptions.Timeout:
        return sent_url

Here, the URL I am checking is sent_url. For reference, this is how I create the session:

http_session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(max_retries=0)
http_session.mount("http://", http_adapter)
http_session.mount("https://", http_adapter)

However, one of the requirements of this program is that this must work for dead links. Based off of this, I set a connection timeout (and read timeout for good measures). After playing around with the values, it still takes about 5-10 seconds for the request to fail with this stacktrace no matter what value I choose. (Maybe relevant: in the browser, it gives DNS_PROBE_POSSIBLE.)

Now, my problem is: 5-10 seconds is too long to wait for if a link is dead. There are many links that this program needs to check, and I do not want a few dead links to be such a large bottleneck, hence I want to configure this DNS lookup timeout.

I found this post which seems to be relevant (OP wants to increase the timeout, I want to decrease it) however the solution does not seem applicable. I do not know the IP addresses that these URLs point to. In addition, this feature request from years ago seems relevant, but it did not help me further.

So far, the best solution to me seems to just spin up a coroutine for each link/a batch of links and then suck up the timeout asynchronously.

I am on Windows 10, however this code will be deployed on an Ubuntu server. Both use Python 3.8.

So, how can I best give my HTTP requests a very low DNS resolution timeout in the case that it is being fed a dead link?

CodePudding user response:

So, how can I best give my HTTP requests a very low DNS resolution timeout in the case that it is being fed a dead link?

Separate things.

Use urllib.parse to extract the hostname from the URL, and then use dnspython to resolve that name, with whatever timeout you want.

Then, and only if the resolution was correct, fire up requests to grab the HTTP data.

@blurfus: in requests you can only use the timeout parameter in the HTTP call, you can't attach it to a session. It is not spelled out explicitly in the documentation, but the code is quite clear on that.

There are many links that this program needs to check,

That is a completely separate problem in fact, and exists even if all links are ok, it is just a problem of volume.

The typical solutions fell in two cases:

  • use asynchronous libraries (they exist for both DNS and HTTP), where your calls are not blocking, you get the data later, so you are able to do something else
  • use multiprocessing or multithreading to parallelize things and have multiple URLs being tested at the same time by separate instances of your code.

They are not completely mutually exclusive, you can find a lot of pros and cons for each, asynchronous codes may be more complicated to write and understand later, so multiprocessing/multithreading is often the first step for a "quick win" (especially if you do not need to share anything between the processes/threads, otherwise it becomes quickly a problme), yet asynchronous handling of everything makes the code scales more nicely with the volume.

  • Related