How to choose the values for httpClient retry policy and timeout policy?-CodePudding

I have a situation where I try to use a Retry Policy and a Timeout Policy that is applied to every http call when the first call fails. I have some parameters that are read from configuration: retryCount, sleep and the timeout value.

services.AddHttpClient<Authentication>()
                .AddPolicyHandler((services, request) => HttpPolicyExtensions.HandleTransientHttpError()
                    .OrResult(msg => msg.StatusCode == System.Net.HttpStatusCode.BadGateway)
                    .OrResult(msg => msg.StatusCode == System.Net.HttpStatusCode.RequestTimeout)
                    .Or<TimeoutRejectedException>()
                    .WaitAndRetryAsync(retryCount, retryAttempt => TimeSpan.FromSeconds(Math.Pow(sleep, retryAttempt))))
                .AddPolicyHandler(HttpResponseMessageExtensions.GetTimeoutPolicy(DefaultTimeoutInMinutes));

Is there any preferred solution or any formula that can be used for the relationship between the timeout per retry, the timeout per client and/or the sleep value?

In my case the time taken for a failed call exceeds the timeout value when the retryCount has a big value and I receive this error message:

As far as I know the timeout per client is by default 100s and can be changed but what is the better option for choosing the values?

I also read something about a backoff mechanism but I am not sure how it works.

CodePudding user response：

I would suggest to separate policy definitions from the policy registrations.

Policy definitions

var retryPolicy = HandleTransientHttpError()
        .OrResult(msg => msg.StatusCode == HttpStatusCode.BadGateway)
        .OrResult(msg => msg.StatusCode == HttpStatusCode.RequestTimeout)
        .Or<TimeoutRejectedException>()
        .WaitAndRetryAsync(retryCount, retryAttempt => TimeSpan.FromSeconds(Math.Pow(sleep, retryAttempt)));

var timeoutPolicy = HttpResponseMessageExtensions
        .GetTimeoutPolicy(DefaultTimeoutInMinutes));

Policy registration

Local timeout

If you want to have per request (so called local) timeout then you should chain them like this:

var strategy = Policy.WrapAsync(retryPolicy, timeoutPolicy);
services.AddHttpClient<Authentication>()
        .AddPolicyHandler(strategy);

Global timeout

If you want to have an overarching (so called global) timeout which covers all retry attempts then you should chain them like this:

var strategy = Policy.WrapAsync(timeoutPolicy, retryPolicy);
services.AddHttpClient<Authentication>()
        .AddPolicyHandler(strategy);

In this scenario you don't need the Or<TimeoutRejectedException> builder method in the retryPolicy.

Further suggestions

Combine `OrResult` clauses

var statuses = new[] { HttpStatusCode.BadGateway, HttpStatusCode.RequestTimeout };
...
var retryPolicy = HandleTransientHttpError()
        .OrResult(msg => statuses.Contains(msg.StatusCode))
        .Or<TimeoutRejectedException>()
        .WaitAndRetryAsync(retryCount, 
           retryAttempt => TimeSpan.FromSeconds(Math.Pow(sleep, retryAttempt)));

Make sure policies are compatible

Your retryPolicy is an IAsyncPolicy<HttpResponseMessage> policy. Make sure that the timeout policy is defined similarly

IAsyncPolicy<HttpResponseMessage> timeout = Policy.TimeoutAsync<HttpResponseMessage>(timeout);

If needed define both local and global timeouts

var strategy = Policy.WrapAsync(globalTimeoutPolicy, retryPolicy, localTimeoutPolicy);
services.AddHttpClient<Authentication>()
        .AddPolicyHandler(strategy);

UPDATE #1

What happens if I am not using the WrapAsync method for the two policies? Is there any risk?

If I understand your question correctly then you are interested about the differences between these two:

services.AddHttpClient<Authentication>()
        .AddPolicyHandler(retryPolicy)
        .AddPolicyHandler(timeoutPolicy);

services.AddHttpClient<Authentication>()
        .AddPolicyHandler(Policy.WrapAsync(retryPolicy, timeoutPolicy));

The AddPolicyHandler method registers a PolicyHttpMessageHandler which is a DelegatingHandler
- If you call it twice then you register two DelegatingHandlers, so the exception propagation is done by the ASP.NET Core
If you use WrapAsync then the escalation remains inside the Polly domain, in a single DelegatingHandler

CodePudding user response：

I decided to post another answer because

the previous one focused more on the best practices (rather than answering OP's question)
the previous post would become pretty lengthy if I would amend that

Is there any preferred solution or any formula that can be used for the relationship between the timeout per retry, the timeout per client and/or the sleep value?

The short answer is NO. As always it depends.

I try to list several factors which should be put into account whenever you want to determine the actual values of retryCount, sleep and timeout.

Is the decorated functionality consumer facing?

If the functionality is directly consumer facing you should specify fairly low values. You should define them in a way that the potentially introduced observable impact is acceptable.

Let's say at most 3 retries within total 10 seconds timeout might be okay (depending on the requirements). But at most 10 retries within total 100 seconds timeout is most probably not tolerable by any consumer. No-one has the willingness to wait that much.

What is the 95% percentile response time of the downstream system?

In order to be able to specify correct timeout you need to know how the downstream system is performing under normal and high load. If you set the timeout too low you might shortcut a request which would otherwise succeed. If it is set too high then your application might wait for a never receiving response.

You should also know what is the usual unavailability time. Lets say the downstream service is usually recovers from unreachability ~10-15 seconds then your resilient strategy should not aim lower (8-10 seconds with retries and total timeout).

If these metrics are not available then you should start with a conservative setup (fairly low values) and log all the failed attempts. Then adjust them accordingly after you have read your logs. It might take several iterations to find the magic numbers. Please bear in mind that you should do this exercise for each downstream system separately.

What do you want to do with those requests that are failed after all retry attempts?

If you can ignore the failed requests then set low values.

If you need to manually process them then try to be more liberal (set higher values) in order to try to minimize their number.

If you need them to eventually succeed then prefer WaitAndRetryForever over WaitAndRetry with a specific retryCount.

How many concurrent clients are running?

If you have fairly low number of concurrent clients then the sleep duration could be fairly static. You don't need to do exponential backoff.

If you have fairly large amount of concurrent clients then you should consider to use exponential backoff with jitter to avoid hitting the (self)-healing downstream system at the same time. Here you can find a simple example for that.