I have a situation where I try to use a Retry Policy and a Timeout Policy that is applied to every http call when the first call fails.
I have some parameters that are read from configuration: retryCount
, sleep
and the timeout
value.
services.AddHttpClient<Authentication>()
.AddPolicyHandler((services, request) => HttpPolicyExtensions.HandleTransientHttpError()
.OrResult(msg => msg.StatusCode == System.Net.HttpStatusCode.BadGateway)
.OrResult(msg => msg.StatusCode == System.Net.HttpStatusCode.RequestTimeout)
.Or<TimeoutRejectedException>()
.WaitAndRetryAsync(retryCount, retryAttempt => TimeSpan.FromSeconds(Math.Pow(sleep, retryAttempt))))
.AddPolicyHandler(HttpResponseMessageExtensions.GetTimeoutPolicy(DefaultTimeoutInMinutes));
Is there any preferred solution or any formula that can be used for the relationship between the timeout per retry, the timeout per client and/or the sleep value?
In my case the time taken for a failed call exceeds the timeout value when the retryCount
has a big value and I receive this error message:
As far as I know the timeout per client is by default 100s and can be changed but what is the better option for choosing the values?
I also read something about a backoff mechanism but I am not sure how it works.
CodePudding user response:
I would suggest to separate policy definitions from the policy registrations.
Policy definitions
var retryPolicy = HandleTransientHttpError()
.OrResult(msg => msg.StatusCode == HttpStatusCode.BadGateway)
.OrResult(msg => msg.StatusCode == HttpStatusCode.RequestTimeout)
.Or<TimeoutRejectedException>()
.WaitAndRetryAsync(retryCount, retryAttempt => TimeSpan.FromSeconds(Math.Pow(sleep, retryAttempt)));
var timeoutPolicy = HttpResponseMessageExtensions
.GetTimeoutPolicy(DefaultTimeoutInMinutes));
Policy registration
Local timeout
If you want to have per request (so called local) timeout then you should chain them like this:
var strategy = Policy.WrapAsync(retryPolicy, timeoutPolicy);
services.AddHttpClient<Authentication>()
.AddPolicyHandler(strategy);
Global timeout
If you want to have an overarching (so called global) timeout which covers all retry attempts then you should chain them like this:
var strategy = Policy.WrapAsync(timeoutPolicy, retryPolicy);
services.AddHttpClient<Authentication>()
.AddPolicyHandler(strategy);
In this scenario you don't need the Or<TimeoutRejectedException>
builder method in the retryPolicy
.
Further suggestions
Combine OrResult
clauses
var statuses = new[] { HttpStatusCode.BadGateway, HttpStatusCode.RequestTimeout };
...
var retryPolicy = HandleTransientHttpError()
.OrResult(msg => statuses.Contains(msg.StatusCode))
.Or<TimeoutRejectedException>()
.WaitAndRetryAsync(retryCount,
retryAttempt => TimeSpan.FromSeconds(Math.Pow(sleep, retryAttempt)));
Make sure policies are compatible
Your retryPolicy
is an IAsyncPolicy<HttpResponseMessage>
policy. Make sure that the timeout policy is defined similarly
IAsyncPolicy<HttpResponseMessage> timeout = Policy.TimeoutAsync<HttpResponseMessage>(timeout);
If needed define both local and global timeouts
var strategy = Policy.WrapAsync(globalTimeoutPolicy, retryPolicy, localTimeoutPolicy);
services.AddHttpClient<Authentication>()
.AddPolicyHandler(strategy);
UPDATE #1
What happens if I am not using the
WrapAsync
method for the two policies? Is there any risk?
If I understand your question correctly then you are interested about the differences between these two:
services.AddHttpClient<Authentication>()
.AddPolicyHandler(retryPolicy)
.AddPolicyHandler(timeoutPolicy);
services.AddHttpClient<Authentication>()
.AddPolicyHandler(Policy.WrapAsync(retryPolicy, timeoutPolicy));
- The
AddPolicyHandler
method registers aPolicyHttpMessageHandler
which is aDelegatingHandler
- If you call it twice then you register two
DelegatingHandler
s, so the exception propagation is done by the ASP.NET Core
- If you call it twice then you register two
- If you use
WrapAsync
then the escalation remains inside the Polly domain, in a singleDelegatingHandler
CodePudding user response:
I decided to post another answer because
- the previous one focused more on the best practices (rather than answering OP's question)
- the previous post would become pretty lengthy if I would amend that
Is there any preferred solution or any formula that can be used for the relationship between the timeout per retry, the timeout per client and/or the sleep value?
The short answer is NO. As always it depends.
I try to list several factors which should be put into account whenever you want to determine the actual values of retryCount
, sleep
and timeout
.
Is the decorated functionality consumer facing?
If the functionality is directly consumer facing you should specify fairly low values. You should define them in a way that the potentially introduced observable impact is acceptable.
Let's say at most 3 retries within total 10 seconds timeout might be okay (depending on the requirements). But at most 10 retries within total 100 seconds timeout is most probably not tolerable by any consumer. No-one has the willingness to wait that much.
What is the 95% percentile response time of the downstream system?
In order to be able to specify correct timeout
you need to know how the downstream system is performing under normal and high load. If you set the timeout
too low you might shortcut a request which would otherwise succeed. If it is set too high then your application might wait for a never receiving response.
You should also know what is the usual unavailability time. Lets say the downstream service is usually recovers from unreachability ~10-15 seconds then your resilient strategy should not aim lower (8-10 seconds with retries and total timeout).
If these metrics are not available then you should start with a conservative setup (fairly low values) and log all the failed attempts. Then adjust them accordingly after you have read your logs. It might take several iterations to find the magic numbers. Please bear in mind that you should do this exercise for each downstream system separately.
What do you want to do with those requests that are failed after all retry attempts?
If you can ignore the failed requests then set low values.
If you need to manually process them then try to be more liberal (set higher values) in order to try to minimize their number.
If you need them to eventually succeed then prefer WaitAndRetryForever
over WaitAndRetry
with a specific retryCount
.
How many concurrent clients are running?
If you have fairly low number of concurrent clients then the sleep
duration could be fairly static. You don't need to do exponential backoff.
If you have fairly large amount of concurrent clients then you should consider to use exponential backoff with jitter to avoid hitting the (self)-healing downstream system at the same time. Here you can find a simple example for that.