Building for resiliency

How to avoid testing for this type of failure?

1 — Avoid unbounded systems and APIs, at all cost.

2 — Identify the upper bounds of systems early and scale out capacity units (i.e. cluster, shard, …). Each capacity unit should have all the systems required to operate without the liveliness of sibling capacity unit. (i.e. Jira and Confluence shard)

We spent a lot of time dissecting this particular incident — specially diving into how we could have prevented it in the first place. The team proposed a lot of great ideas but one in particular stuck out. One of the engineers on the team proposed building a back up synchronisation mechanism in case the primary fails. Having spent a good part of my career building resilient systems, this hit a bit of a sore spot. On the surface, backup systems sound like a great idea, but in practice these systems are seldom exercised (yes, there are exceptions such as Netflix who are amazing at this …). In the case of data centre failover, operations teams have to consider the potential data loss scenario, given most systems are not replicated in real time, when a fail over is required. Most IT and business teams are not ok with data loss. In my experience, teams have opted to wait for their dependencies to come back online before pulling the trigger to fail over to a secondary data centre, region, or some other system. We decided to not build a backup synchronisation mechanisms here. Instead, we applied a few different patterns to make the service more resilient.

How to avoid testing for this type of failure?

3 — Do less work when your system is impaired, not more. If the system is impaired, do not rely on highly complex and infrequently exercised code to repair the system.

4 — Fail gracefully. When the system is not able to operate at full capacity, throttle requests. While operators (human or otherwise) repair the system, have customers back off.

5 — Optimise for operating in a degraded state instead of failing hard. For example, consider processing stale data from caches for read operations in case of database failure instead to rejecting both read and write operations.

6 — Handle downstream failures gracefully. If your system is unable to communicate with downstream systems, fail gracefully. Exponential back off is a fantastic pattern to utilise here. Instead to trying harder to process requests (point 3), slow down and work less when your down stream systems are telling you to slow down. Consider fast fail mechanisms using decay algorithms to track the liveliness of your down stream systems.

The last thing we thought about is blast radius reduction. We know that complex systems will eventually fail — this is a no brainer. The best that service owners can do when these systems fail is to prevent a full system outage. In our case, all customers experienced some form of outage from the system during the 9 hour incident. Using the patterns above we were able able to limit the blast radius of future incidents — especially when down stream systems failed.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store