17 Nov, 2021

Retry Mechanisms and Exponential Backoff

Making services more resilient.

Retry

Retries are a core resiliency pattern which help enhance service availability by re-attempting failed operations.

A retry is just a repeated operation. When an error occurs during an operation a retry repeats the operation. Retries are usually combined with some sort of “backoff strategy” which provides a timeout between operations, in order to prevent a resource from being overwhelmed.

Retries allow services to increase availability at the expense of increased latency.

Many classes of errors are transient and rooted in network or server overload. These errors can be quickly resolved through a retry. If latency of a retry can be tolerated, it allows for increased availability at the cost of increased latency.

Transient Failures

Transient failures are the failures that occur while communicating to an external service, and that external service is not available.

Once we have identified the fault as a transient fault, we need to use a retry mechanism so that the issue gets resolved by calling the service again.

Let’s imagine a scenario where the transient fault is happening because the service is overloaded or some throttling is implemented at the service end. This service is rejecting new calls. This is a transient fault because if we call the service after some time, our call could possibly succeed. There could also be a possibility that our retry requests are further adding to the overload of the busy service, which would mean that the service will take longer to recover from this state.

Our requests are contributing further to the reason of the fault. How can we solve this problem? We use exponential backoff.

Exponential Backoff

Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate.

exponential backoff refers to an algorithm used to space out repeated retransmissions of the same block of data, often to avoid network congestion. ~ Wikipedia

In exponential backoff, after c collisions, a random number of slot times between 0 and 2c − 1 is chosen. After the first collision, each sender will wait 0 or 1 slot times. After the second collision, the senders will wait anywhere from 0 to 3 slot times (inclusive). After the third collision, the senders will wait anywhere from 0 to 7 slot times (inclusive), and so forth.

Exponential backoff time:

$Exponential Backoff$

If we use a fixed retry interval, all upstream leaf services will retry at the exact same rhythm. This could potentially mount a DOS (Denial of Service) attack to our own network. Using jitter, we intentionally introduce randomness into our retry rhythm, so all the retry calls generated from upstream services are smoothly distributed over time.

Code:

Here is a complete example of retry with exponential backoff written in Go:

package main

import (
	"context"
	"fmt"
	"math"
	"net/http"
	"time"
)

// defaults
const (
	DefaultMaxRetries = 5
	DefaultDelay      = 2 * time.Second
	DefaultMaxDelay   = 10 * time.Second
)

type RetryClient struct {
	maxRetries int
	delay      time.Duration
	maxDelay   time.Duration
}

func NewRetryClient(maxRetries int, delay, maxDelay time.Duration) *RetryClient {
	// input sanitization
	if maxRetries <= 0 {
		maxRetries = DefaultMaxRetries
	}
	if delay <= 0 {
		delay = DefaultDelay
	}
	if maxDelay <= 0 {
		maxDelay = DefaultMaxDelay
	}

	return &RetryClient{maxRetries, delay, maxDelay}
}

func (r *RetryClient) Run(f func() error) error {
	return r.RunWithCtx(context.Background(), func(_ context.Context) error {
		return f()
	})
}

func (r *RetryClient) RunWithCtx(ctx context.Context, f func(context.Context) error) error {
	maxRetries := r.maxRetries
	delay := r.delay
	maxDelay := r.maxDelay

	// input sanitization
	if maxRetries <= 0 {
		maxRetries = DefaultMaxRetries
	}
	if delay <= 0 {
		delay = DefaultDelay
	}
	if maxDelay <= 0 {
		maxDelay = DefaultMaxDelay
	}

	retryAttempts := 0
	for {
		// run f
		err := f(ctx)
		if err == nil {
			return nil
		}

		retryAttempts++
		fmt.Printf("Retry #%d\n", retryAttempts)

		if retryAttempts == maxRetries {
			fmt.Println("Max retries reached.")
			return err
		}

		// if FinalError, stop execution
		switch v := err.(type) {
		case FinalError:
			return v.e
		}

		t := time.NewTimer(exponentialBackoff(retryAttempts))
		select {
		case <-t.C:
		case <-ctx.Done():
			// drain channel and return error
			if !t.Stop() {
				<-t.C
			}
			return err
		}
	}
}

// Don't continue retrying
func Stop(err error) error {
	return FinalError{err}
}

type FinalError struct {
	e error
}

func (f FinalError) Error() string {
	return f.e.Error()
}

func exponentialBackoff(retryAttempts int) time.Duration {
	var backoff time.Duration
	exponentialDelay := int64(math.Floor((math.Pow(2, float64(retryAttempts)) - 1) * 0.5))

	backoff = time.Duration(exponentialDelay)
	return backoff
}

func main() {
	// create a new client and run
	retryClient := NewRetryClient(10, 1*time.Millisecond, 10*time.Millisecond)
	err := retryClient.Run(func() error {
		resp, err := http.Get("https://httpstat.us/503")
		switch {
		case err != nil:
			// return error from request
			return err
		case resp.StatusCode == 0 || resp.StatusCode >= 500:
			return fmt.Errorf("Retryable HTTP status: %d: %s", resp.StatusCode, http.StatusText(resp.StatusCode))
		case resp.StatusCode != 200:
			// non-retryable error, call Stop()
			return Stop(fmt.Errorf("Non-retryable HTTP status: %d: %s", resp.StatusCode, http.StatusText(resp.StatusCode)))
		}
		return nil
	})
	if err != nil {
		fmt.Println(err)
	}
}

We create a retry client, which is used to run a function which fetches a page with the 503 status code. The function checks if the status code is retryable or not. If not retryable, it calls Stop(); which is exposed by our retry logic.

The delay times are intentionally set to low; this was for testing purposes, you can edit it.

The retry logic is an infinite-loop, which exits when max retries are reached, or when it encounters FinalError, which is sent through the Stop() function. It also uses a timer channel which will send the current time on its channel after duration of exponential backoff.

Let’s do a GET request on a normal page (https://httpstat.us/200) in our function which is to be retried:

❯ go run main.go

~/retry-test
❯

It runs successfully!

Running on a page with 503 status code (https://httpstat.us/503):

❯ go run main.go
Retry #1
Retry #2
Retry #3
Retry #4
Retry #5
Retry #6
Retry #7
Retry #8
Retry #9
Retry #10
Max retries reached.
Retryable HTTP status: 503: Service Unavailable

Since the page returns 503 indefinitely, the retry mechanism comes into play, and tries to retry until it reaches the maximum number of retry attempts.

Running on a page with 404 status code (https://httpstat.us/404):

❯ go run main.go
Retry #1
Non-retryable HTTP status: 404: Not Found

Since 404 is non-retryable in nature, the retry mechanism stops itself.

We have now successfully implemented a retry mechanism with exponential backoff in Go! I hope this post was insightful!

Here are some tips for retry mechanisms:

If you are unsure what type of failure you anticipate and whether or not it is recoverable, the best option is to fail fast, not to retry at all.
If you have a mission-critical operation, retry mechanisms alone will not solve your problem. It will require a careful design of full-fledged resilience engineering that covers retry mechanism, chaos engineering, circuit-breakers, monitoring, etc.