Mikro On Software: Chaos Engineering

Showing posts with label Chaos Engineering. Show all posts

Saturday, January 24, 2026

Increase resilience: Decouple external system calls

For most software companies resilience means that the replication factor is 2 or higher.

Well that's far from good. You resilience is still bad as with two instances of your weak micro service.

One way to increase the internal resilience of your service is to decouple external service calls. This includes Data Base operations, Kafka, HTTP. For HTTP calls it is very common and most developer understand that this HTTP calls can go wrong but Data Base calls, yes all external services are running over a network, firewall, load balancer, switches, ... So it's a good idea to make all external call more resilient:

Put the external calls into own treads. This keeps your application running in case of an error or time out.
And have a time out or watchdog on it. By the way, Java futures are a easy way to do it.
Check the results also for write operations.
Extend logs and monitoring to recognize external call errors.

Monday, February 7, 2022

Fail Fast vs Fail Early

If you look into Stackoverflow or so, you see tha Fail Fast and Fail Early used as synonyms. But that is not true. Let me just explain it.

Fail Early

Fail Early is working on method level. The idea of Fail Early is to verify at method entry the parameters. The result is an early return or an early exception. The downside of fail early is that you have multiple exit points. That's why you have to keep your method short. But you have always to keep the method short or you handle different concerns in it.

At the end fail early is a technique to verify method parameters at the start of the method to avoid issues caused by data. The Lombock annotation @NonNull is a simple kind of fail early.

Fail Fast

Fail Fast is working on System Level. It's a technique to improve the resilience of micro services. It's a server side implementation of a circuit breaker. I developed this idea 2014 for a German company but I think many other software engineer also has the same idea. The fail fast pattern could be implemented in diefferent ways and for differen aspects of the software system. Now two examples of the fail fast pattern show the different aspect of it:

Fail Fast Example 1: Health Check

The micro service monitors his own health and if the service in unhealthy maybe because of an unhealthy or not responding subsystem the service can decide of fail fast some related (HTTP) request without going the whole service to the unhealthy state.

Fail Fast Example 2: Performance Check

The micro service monitors his own request runtime. If the runtime exceeded a timeout the the service can cap the reuqest execution to free resources and response with an (HTTP) error code. This improves the performance resilience of the software system.

If you understand this pattern, you can imagine that are many more cases for Fail Fast.

Over all, the Fail Fast pattern avoid that slow or weak subsystems can shuts down all your services. It's a kind of an circuit breaker apttern.

Summary

The difference between Fail Fast and Fail early is the Abstraction Level. Fail Fast protects your micro service. Fail early makes your methods more stable. At the end you should use booth.

Tuesday, November 3, 2020

Free Book about Chaos Engineering from Verica

Get a free book about Chaos Engineering written by Chasey Rosenthal and Nora Jones. this is Chasey's second book about Chaos Engineering. His name is associated with well known Chaos Monkey. Follow the link below to get this book for free.

https://www.verica.io/book/

Sunday, September 13, 2020

How to Start with Chaos Engineering

First, you should be clear of what is Chaos Engineering for.

Second, create the technical base for Chaos Engineering.

Third, run Chaos Engineering test and analyze the results.

Fourth, Go to one until you reached your goal.

Let's start with first point, what is Chaos Engineering for. Chaos Engineering a software test method that improve the reliability and resilience or robustness of software. If resilience and software reliability are part of product, ask your PO (Product Owner) then a point in the software lifecycle comes where you need to improve these two product features, then you need Chaos Engineering.

Chaos Engineering is the new top of software test pyramid. With Unit Test you guarantee that a method or function works at some points like expected. With component test you test that two modules work together at some points. With integration test you test your component or service in context of the whole software landscape, often the test system is called PRELIVE or NONPROD. But you also test some points. With all these test you try to guarantee the absent of functional errors. Most of these test are positive path test or edge case test. Regardless how hight your test coverage is, there a unknown problems in your software and with this kind of test you never find them or like Edsger Wybe Dijkstra wrote:

Program testing can be a very effective way to show the presence of bugs, but is hopelessly inadequate for showing their absence.

And exactly this is the point where Chaos Engineering comes into play. Chaos Engineering are able to find unknown problems or better unknown unknown problem.

Donald Rumsfeld: There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know.

Chaos Engineering is a software testing method to address unknown unknown errors. And from a more practical point of view, Chaos Engineering is the way to improve reliability and resilience (robustness). In the upcoming articles I will explain the other points of Chaos Engineering.