Resilience in a cloud architecture

A typical cloud architecture - from a cloud native solution to a public cloud solution - consists of multiple connected services. Some of these services are your own, others are third-party services. Common for these is that they all communicate with each other over a network and that they depend on each other's availability.

When your application integrates with another service - your application must handle erroneous behavior from the service. Thinking resilience is, based on my experience, something we developers often forget and is first addressed once we experience availability issues with the service in production.

Timeout

When your application calls an external service, the service may not always respond. If this call is a synchronous call as part of a HTTP request to your application, the response will be delayed until the call to the external service times out. If your application continues to receive requests that cause calls to this external service, the long-lasting requests could begin to pile up. The web server running your application has a limited number of threads for handling requests, and when all these are in use, the web server will begin dropping new requests. Being aware of the timeout set on the client contacting this service might be crucial for keeping your application fully operational.

The client you are using to call the external service most likely has a default timeout value set, but it is not a value you can rely on is correct for your application. Especially in generic HTTP clients, default timeouts are usually set very high. For example, HttpClient found in .NET has a default timeout of 100 seconds. Knowing what to expect from the service you are calling and the context of your application, you should adjust the timeout to fit your needs. When using client libraries made for calling a service provided by e.g. a public cloud provider, you can usually expect more correctly set default timeouts. Anyhow, it is still a good idea to get familiar with what timeout the client is using. The service provider does not know how you are using their service and in some cases the timeouts set in the client library does not fit the context and nature of your application.

Changing the timeout on clients is usually pretty straight forward, but not in all cases. For these cases there exists resilience libraries that can be used to encapsulate client calls providing a configurable timeout. Polly is a popular resilience library for .NET, and has a Timeout-policy which can be used to do this. In resilience4j, a go-to resilience library for Java, resilience4j-timelimiter can be used.

Automatic retry

Sometimes, a service is unavailable for a short time. E.g. a node hosting the service goes down and the load balancer in front is still sending some traffic to the node, or a network issue causes a few packages to be dropped. These kinds of transient errors are very common in a cloud architecture. Having set a correct timeout towards this service will cause our application quickly to abort when the service is not responding, but we still will have to respond with an error. This is annoying, especially if the service overall is available - it was just our call to it that failed.

Before assuming that a service is unavailable when we get a transient error from it, we can retry to see if we get a successful response. This type of functionality can be built into to the client that is making the calls to the service and to be performed automatically on certain erroneous behavior. This is often referred to as automatic retry.

When applying automatic retry, many different retry strategies can be used. The simplest strategies are to either immediately retry or wait some predefined time before retrying - often with multiple attempts. But usually when a service responds with an error you do not know how long this behavior will last. You do not want to retry without giving the service time to come up again, but you do not want to wait longer than necessary either. A popular strategy is exponential back-off, where you begin with a very short wait period, and increase it exponentially before each retry attempt. Choosing the correct retry strategy depends on the use case of your application. If the call to the service is a synchronous one, you probably cannot afford many retry attempts and long wait periods. If it is asynchronous, something like exponential back-off can greatly increase the likelihood of the call being successful.

A common mistake is to retry errors that are not transient communication errors. Retrying errors like application exceptions make little sense as these errors usually do not just go away. Also, not all non-successful HTTP response codes should be treated as transient errors. For example, a service call returning a 404 will most likely continue to return 404 and retrying it will be useless.

Dedicated client libraries often have automatic retry built-in and the parameters are usually configurable - at least more basic retry strategies. If you find that the library you are using cannot be configured the way you want, these can easily be wrapped by a resilience library to achieve the retry strategy you need. The same libraries can be used to wrap generic HTTP clients with a retry strategy as well. Polly has the Retry- and WaitAndRetry-policies and resilience4j has resilience4j-retry, both allowing to configure advanced retry strategies.

Circuit breakers

Automatic retry is good for handling transient errors from a service, but if the service is unavailable for a longer time it will not help retrying the service repeatedly. In these situations, the focus should be on making sure your application handles that the service in fact is unavailable. One step to achieve this is to prevent your application from keep calling a service that is down. This will let your application respond momentarily in situations where a dependent service is down because it does not need to wait for a failing call. A much-used resilience pattern for this is the circuit breaker pattern.

I will not go into detail on circuit breakers, but in short, they allow calls to the service as long as the service is healthy. If calls begin to fail it will start bypassing calls to the service and instead just fail immediately. After a while bypassing calls, it will try to let through a single request to see if the service is back up. If it is up, following calls will also be allowed through, and if the call still fails it will continue to bypass calls. For a detailed explanation of the circuit breaker pattern, I recommend Martin Fowler's article on it.

In comparison to automatic retry, it is rarer that a service's client library has a circuit breaker built into it. At least that it is configurable. But you have probably already guessed that you can use a resilience library to achieve this. Both Polly and resilience4j has highly configurable circuit breakers with advanced features.

When using circuit breakers, it is recommended to monitor the state of each of them. Knowing if it is the circuit breaker that is blocking calls can greatly help in debugging situations. Adding monitoring of your circuit breakers into your existing monitoring system is usually quite easy with the circuit breakers found in resilience frameworks. For this reason, even if the client library has a built-in circuit breaker, it could also be a good idea to use the circuit breaker in your resilience framework instead so that you can monitor all your circuit breakers in the same way.

Summary

As communicating with external services is a big part of a typical cloud architecture, we have seen why thinking resilience could be important to achieve a robust cloud solution. Setting a timeout, performing automatic retry and applying a circuit breaker can increase the resilience of your application in different scenarios. Other techniques that we have not discussed here can be applied as well, like bulkhead isolation, fallback strategies and caching. Make sure to check them out in your preferred resilience library if you find any of them interesting. And finally, if you would like to dive deeper into building resilient software, make sure to check out Michael T. Nygard's book Release it!.

Timeout

Automatic retry

Circuit breakers

Summary

Did you like the post?