Hystrix in monolithic architecture

Zbigniew Artemiuk

Hystrix is a quite mature library for isolating remote operations in distributed systems. Usually it is considered by developers only when operating in “pure” microservice architecture. Is it worth looking at even when we are “just” connecting to one or two external systems in our project?

I think yes, but let’s go through all the benefits of Hystrix you may be interested in if your project connects to some external systems.

Fallbacks

When connecting to external systems we usually do not think about any fallbacks we should support in case of remote system being down. We tend to be optimistic and assume, which in 99% of cases is true, that this system will be responding without any error and responding quite fast. Some more mature developers will handle most predictable errors, log them and maybe inform the user that the operation has failed. What changes if we start using Hystrix?

For sure we will be encouraged (or maybe even forced) to think about what should be done in case of errors, because basic configuration of Hystrix consists of defining fallback for given business operation. Let’s imagine we are designing a service for managing our books (like well-known goodreads.com). For each book we display we would like to load average price it has from external system. In code it can look like


public class BookPriceService {
   BookPrice fetchPriceFor(BookId bookId) { ... }
}

If we’re using Spring and a library for integrating Spring with Hystrix (Hystrix javanica) we can easily change this code to support fallback in case fetching fails. We add one annotation and a fallback function.


public class BookPriceService {
   @HystrixCommand(fallbackMethod = "undefinedPrice")
   BookPrice fetchPriceFor(BookId bookId) { ... }

   BookPrice undefinedPrice() {
      return BookPrice.undefined();
   }
}

Now if our service fails (to be exact some exception from fetchPriceFor method is thrown), we get BookPrice with undefined value (returned by static method BookPrice.undefined() ). Now we just need to support this value in front end and show the user a proper message.

Such fallbacks can be created for many other services (especially for those getting some not crucial information). I can imagine GetMovieRatingService, which in case of error returns 0 or undefined rating, or UserPermissionService, which on fail returns read only permissions.

Timeouts

Dealing with external system becomes very frustrating when remote calls start to lag. Usually we are not prepared for that and we propagate that lag to our system or even to our end users. Let’s for example think about a part of the system where user fills in some form. After submitting to server information is taken from it and an e-mail is sent through SMTP server. Until the e-mail is sent, the form filled by the user will have a spinner indicating something is in progress.

What happens if SMTP server starts to respond very slowly? If we perform all actions synchronously, then of course user will see a spinner, for as long as takes for the response from SMTP server to return. Maybe it is worth to cancel the operation to external system when it takes too long and try to perform it once again. Maybe with this one request we encounter a problem and we will wait forever while the second request after a while will go smoothly?

If we implement it this way user of course will need to redo his operation, but it may happen that re-executing will be faster than waiting endlessly for one “lost” request.

Hystrix will help you configure such behaviour in your system with no pain. It is once again just one annotation.


public class MailService {
   @HystrixCommand(
      fallbackMethod = "sendingFailed",
      commandProperties = {
         @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "3000")
      }
   )
   public SentInformation send(EmailBody emailBody) { … }

   public SentInformation sendingFailed() {
      return SentInformation.failed();
   }
}

With such a setup after 3 seconds each call of send method will be cancelled and fallback method will be executed.

Thread pool separation

There is one more problem with slow external system – exceeding threads. Usually we call external system from the same thread we use to execute our business logic – thread from our application server. What happens when in more and more threads we execute remote calls and they last forever? Of course all our threads hang on this call and we are consuming more and more threads. In the worst scenario we can end up with no more threads for handling any additional connection to the server because all of them are waiting for external system.

That would be terrifying that some external system, which is involved only in some part of our all system functionalities, can break down our whole project.

Once again Hystrix helps us avoid such situation with almost zero cost. By default when you configure Hystrix as shown in previous examples, Hystrix will create additional thread pool, which is separated from the default pool in application server. Of course you can tune this thread pool changing it’s size, queueSize and many others (all described here).

Now if all threads in Hystrix will be consumed, you can for example reject the next ones or queue couple of thems. In general you can tune it the way you want not to reject too many request but also not to hang on execution too long.

What’s more you can configure not only one thread pool. For instance if you connect to 2 external system for each one you can configure a different thread pool. Or even when playing with one system for some quite long-lasting remote calls you can have different setup of thread pool.

Configuring many thread pools of course is not zero cost. You need to have in back of your mind that it increases context switching and load on your machine.

Circuit breaker

The last thing from Hystrix I would like to mention is circuit breaker pattern. In a few words Hystrix is measuring statistics for each call to remote system. If failures are above some thresholds then next calls are automatically rejected by Hystrix without calling external system (Hystrix marks this external system as “down”). Of course not all requests are rejected – from time to time Hystrix will bypass one request to check if the system is up now.

If no, then again next requests are automatically rejected without calling external system till next time of test request will come.

If test request succeeds, then we clean all previous statistics and go to initial state.

What advantages does this solution have? First, we do not add more calls to the external system, when it looks like it has real problems in responding fast. Thanks to this, it can try to recover from being slow to it’s normal state. Second, we don’t have to wait for timeouts to discover that the external system is down: if Hystrix is in “rejection” state than we get rejection on remote call in zero time (fail fast).

And far more

I hope I encouraged you to look into Hystrix in context of your project. Of course it is just a tip of the iceberg. You can find far more in Hystrix like request collapsing, caching, playing with metrics or monitoring. All this you can find here.

comments powered by Disqus