Retry Mechanisms and Exponential Backoff
Making services more resilient.
Retry
Retries are a core resiliency pattern which help enhance service availability by re-attempting failed operations.
A retry is just a repeated operation. When an error occurs during an operation a retry repeats the operation. Retries are usually combined with some sort of “backoff strategy” which provides a timeout between operations, in order to prevent a resource from being overwhelmed.
Retries allow services to increase availability at the expense of increased latency.
Many classes of errors are transient and rooted in network or server overload. These errors can be quickly resolved through a retry. If latency of a retry can be tolerated, it allows for increased availability at the cost of increased latency.
Transient Failures
Transient failures are the failures that occur while communicating to an external service, and that external service is not available.
Once we have identified the fault as a transient fault, we need to use a retry mechanism so that the issue gets resolved by calling the service again.
Let’s imagine a scenario where the transient fault is happening because the service is overloaded or some throttling is implemented at the service end. This service is rejecting new calls. This is a transient fault because if we call the service after some time, our call could possibly succeed. There could also be a possibility that our retry requests are further adding to the overload of the busy service, which would mean that the service will take longer to recover from this state.
Our requests are contributing further to the reason of the fault. How can we solve this problem? We use exponential backoff.
Exponential Backoff
Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate.
exponential backoff refers to an algorithm used to space out repeated retransmissions of the same block of data, often to avoid network congestion. ~ Wikipedia
In exponential backoff, after c collisions, a random number of slot times between 0 and 2c − 1 is chosen. After the first collision, each sender will wait 0 or 1 slot times. After the second collision, the senders will wait anywhere from 0 to 3 slot times (inclusive). After the third collision, the senders will wait anywhere from 0 to 7 slot times (inclusive), and so forth.
Exponential backoff time:
If we use a fixed retry interval, all upstream leaf services will retry at the exact same rhythm. This could potentially mount a DOS (Denial of Service) attack to our own network. Using jitter, we intentionally introduce randomness into our retry rhythm, so all the retry calls generated from upstream services are smoothly distributed over time.
Code:
Here is a complete example of retry with exponential backoff written in Go:
package main
import (
"context"
"fmt"
"math"
"net/http"
"time"
)
// defaults
const (
DefaultMaxRetries = 5
DefaultDelay = 2 * time.Second
DefaultMaxDelay = 10 * time.Second
)
type RetryClient struct {
maxRetries int
delay time.Duration
maxDelay time.Duration
}
func NewRetryClient(maxRetries int, delay, maxDelay time.Duration) *RetryClient {
// input sanitization
if maxRetries <= 0 {
maxRetries = DefaultMaxRetries
}
if delay <= 0 {
delay = DefaultDelay
}
if maxDelay <= 0 {
maxDelay = DefaultMaxDelay
}
return &RetryClient{maxRetries, delay, maxDelay}
}
func (r *RetryClient) Run(f func() error) error {
return r.RunWithCtx(context.Background(), func(_ context.Context) error {
return f()
})
}
func (r *RetryClient) RunWithCtx(ctx context.Context, f func(context.Context) error) error {
maxRetries := r.maxRetries
delay := r.delay
maxDelay := r.maxDelay
// input sanitization
if maxRetries <= 0 {
maxRetries = DefaultMaxRetries
}
if delay <= 0 {
delay = DefaultDelay
}
if maxDelay <= 0 {
maxDelay = DefaultMaxDelay
}
retryAttempts := 0
for {
// run f
err := f(ctx)
if err == nil {
return nil
}
retryAttempts++
fmt.Printf("Retry #%d\n", retryAttempts)
if retryAttempts == maxRetries {
fmt.Println("Max retries reached.")
return err
}
// if FinalError, stop execution
switch v := err.(type) {
case FinalError:
return v.e
}
t := time.NewTimer(exponentialBackoff(retryAttempts))
select {
case <-t.C:
case <-ctx.Done():
// drain channel and return error
if !t.Stop() {
<-t.C
}
return err
}
}
}
// Don't continue retrying
func Stop(err error) error {
return FinalError{err}
}
type FinalError struct {
e error
}
func (f FinalError) Error() string {
return f.e.Error()
}
func exponentialBackoff(retryAttempts int) time.Duration {
var backoff time.Duration
exponentialDelay := int64(math.Floor((math.Pow(2, float64(retryAttempts)) - 1) * 0.5))
backoff = time.Duration(exponentialDelay)
return backoff
}
func main() {
// create a new client and run
retryClient := NewRetryClient(10, 1*time.Millisecond, 10*time.Millisecond)
err := retryClient.Run(func() error {
resp, err := http.Get("https://httpstat.us/503")
switch {
case err != nil:
// return error from request
return err
case resp.StatusCode == 0 || resp.StatusCode >= 500:
return fmt.Errorf("Retryable HTTP status: %d: %s", resp.StatusCode, http.StatusText(resp.StatusCode))
case resp.StatusCode != 200:
// non-retryable error, call Stop()
return Stop(fmt.Errorf("Non-retryable HTTP status: %d: %s", resp.StatusCode, http.StatusText(resp.StatusCode)))
}
return nil
})
if err != nil {
fmt.Println(err)
}
}
We create a retry client, which is used to run a function which fetches a page with the 503
status code. The function checks if the status code is retryable or not. If not retryable, it calls Stop()
; which is exposed by our retry logic.
The delay times are intentionally set to low; this was for testing purposes, you can edit it.
The retry logic is an infinite-loop, which exits when max retries are reached, or when it encounters FinalError
, which is sent through the Stop()
function. It also uses a timer channel which will send the current time on its channel after duration of exponential backoff.
Let’s do a GET request on a normal page (https://httpstat.us/200
) in our function which is to be retried:
❯ go run main.go
~/retry-test
❯
It runs successfully!
Running on a page with 503
status code (https://httpstat.us/503
):
❯ go run main.go
Retry #1
Retry #2
Retry #3
Retry #4
Retry #5
Retry #6
Retry #7
Retry #8
Retry #9
Retry #10
Max retries reached.
Retryable HTTP status: 503: Service Unavailable
Since the page returns 503
indefinitely, the retry mechanism comes into play, and tries to retry until it reaches the maximum number of retry attempts.
Running on a page with 404
status code (https://httpstat.us/404
):
❯ go run main.go
Retry #1
Non-retryable HTTP status: 404: Not Found
Since 404
is non-retryable in nature, the retry mechanism stops itself.
We have now successfully implemented a retry mechanism with exponential backoff in Go! I hope this post was insightful!
Here are some tips for retry mechanisms:
- If you are unsure what type of failure you anticipate and whether or not it is recoverable, the best option is to fail fast, not to retry at all.
- If you have a mission-critical operation, retry mechanisms alone will not solve your problem. It will require a careful design of full-fledged resilience engineering that covers retry mechanism, chaos engineering, circuit-breakers, monitoring, etc.