Building Resilient Platform, Part 1

Published in

codeburst

4 min readJun 22, 2017

The importance of fault tolerance

The majority of businesses have gone online and increasingly are moving to public clouds — entrusting their businesses to cloud providers. These cloud providers are meeting reliability needs of businesses with tools that provide automatic monitoring and deployment, but it is becoming increasingly evident that it is not only cloud providers that need to be reliable: application frameworks need to be reliable and intelligent with self-healing capabilities.

In node.js, for example, our platform has been already providing capabilities like gracefully restarting a worker while maintaining general availability of a computation unit in case of out of memory event or when uncaught errors put the system into an unpredictable state. We view these capabilities as self-healing features of the platform that increase resiliency and allow the system to function in an environment where the bugs are not only possible, but expected.

This has worked well for us, however it is not over yet. In a distributed environment with hundreds of services, failure of a single component can start a chain reaction that may affect the whole system and make it slow or fail completely. This becomes even more probable where each service might be maintained by a different team — adding a factor of unpredictability to the whole system. For more details on importance of fault tolerance please read an excellent blog post Fault Tolerance in a High Volume, Distributed System by Ben Christensen.

In this article I would like to discuss the circuit breaker pattern, how we were already using this pattern to detect and mitigate failures in our service call pipelines, and how we are expanding its use to the whole application.

Let’s assume we have the following simplified version of an application server with a single service call which you can multiply to your desired complexity in your mind.

Here we have an imaginary web or service app with request pipeline (example: express middleware chain) for incoming traffic. The controller upon request initiates a service call that goes through service client pipeline with almost similar set of handlers that format request and parse response on the way back and once response from the service is received, it formats it into http response (html/json) and passes back through the response pipeline to the browser or some other restful client.

The above system is vulnerable to the above described problems affecting downstream and upstream components.

To prevent this we need an element of fault tolerance to be integrated into the system.

What is out there?

The golden standard for fault tolerance and resiliency in distributed environment was greatly popularized by successful rollout of Hystrix circuit breaker with a fallback option, fail fast, fail silent patterns to their complex distributed system. The instrumentation of service call, accumulation and calculation of the metrics and tuning tools with a great dashboard added a lot of value to the Hystrix open-source framework. It is very popular in Java community as it was written in java but similar implementation starting to appear in other programing languages like node.js (https://npmjs.org/package/hystrixjs)

After adding hystrix circuit breaker, our diagram would look as follows:

Now our pipeline to the services in cloud is guarded and will prevent unnecessary calls to the downstream services in time of stress, hence releasing pressure from them and diverting traffic to a fallback route.

If you are a java developer you can use hystrix component right out of the box.

If you are a node.js developer you can try our open source trooba pipeline framework that allows to build service client pipelines out of a set of handlers similar to how one builds an expressjs based application using middleware pattern. One of those handlers is trooba-hystrix-handler that provides hystrix functionality to the service calls.

Here’s how we do it:

Here we are building a simple http client that uses just two handlers executed in the order they were added. The first one is hystrix handler where you can provide an optional fallback and the second one is actual http transport that will make http call to www.ebay.com. The hystrix handler will create a command ‘my-service-command’

As a bonus, you can also integrate hystrix dashboard module into node.js app to get access to really nice metrics below that you can host right out of your application or export it to an external hystrix dashboard.

Here’s how you can do it:

Which will export hystrix dashboard as http://localhost:8000/hystrix and you can start monitoring. To make it short, here’s the link you will end up at viewing the metrics.

But the problem still remains

The general solution to wrap every service call into a circuit breaker proved to be a great way to be nice to the downstream services and increased resiliency of the whole distributed system, but it is far from complete.

Where the rest of the problem is and how to solve it we are going to explore in Part 2 of our series.

codeburst

Building Resilient Platform, Part 1

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in codeburst

Written by Dmytro Semenov

Responses (1)