Building for resiliency
Building for resiliency is the key to scaling your systems and infrastructure. Complex systems fail; that’s the inevitable truth.
It was around 6:45 PM when the oncall engineer started investigating an anomaly with a service that supported tens of millions of requests per second. The oncall engineer noticed that at 5 PM and then again at 6 PM, the service was responding with 500s (internal server error) for a period of five to ten minutes and then auto recovered. The availability graph looked kind of like this -
This was obviously not expected behaviour. About an hour into the incident (7:45 PM by then) the primary oncall engineer decided to page in the secondary oncall engineer. After spending some more time with the secondary oncall engineer, they both decided to page in the oncall manager as the investigation was not going well. This was when I joined the call and spent the next 6 hours investigating why the service, that had been running without issues for a few years at that point, was all of a sudden behaving this way.
Rewinding to earlier that day (around 2 PM), we had a big product launch in a different group. This new product took a dependency on the service I was managing for some core functionality — nothing out of the ordinary. The product was a big hit so that meant usage was ramping up on the new product and in turn on the service that was experiencing issues. At that point in the life of the service, it had been processing many millions of transactions per second so the team did not pay too much attention to another team ramping up their usage. ‘Auto Scaling’ was enabled on the fleet, so when usage ramped up the team slept comfortably knowing that automation would kick in and scale out the fleet to handle the traffic.
What I didn’t realise at the time was that I was in the midst of learning a painful lesson in running and scaling distributed systems. Distributed systems do not scale linearly as load increases. In this case, as usage ramped up, the automated infrastructure scaling systems added more and more hosts to the fleet. This was great until the point where a cache synchronisation (hourly in this case; that synchronised fleet level metadata) mechanisms could no longer keep up with the number of hosts to synchronise data across. The fleet size had grown to the point where the data about the fleet could no longer be collected fast enough and in turn the cache could not be updated fast enough by the ‘synchronisizer’ due to the sheer size of the fleet after it was flushed (as designed).
We fixed this problem, permanently, by splitting up the fleet discretely into units of capacity (i.e. multiple clusters of 100 hosts each). This pattern helped us identify the upper bound of the system early (or set artificial ones) and enabled us to scale out horizontally without having to worry about the limitations of horizontal scalability of a very large and complex system.
How to avoid testing for this type of failure?
1 — Avoid unbounded systems and APIs, at all cost.
2 — Identify the upper bounds of systems early and scale out capacity units (i.e. cluster, shard, …). Each capacity unit should have all the systems required to operate without the liveliness of sibling capacity unit. (i.e. Jira and Confluence shard)
We spent a lot of time dissecting this particular incident — specially diving into how we could have prevented it in the first place. The team proposed a lot of great ideas but one in particular stuck out. One of the engineers on the team proposed building a back up synchronisation mechanism in case the primary fails. Having spent a good part of my career building resilient systems, this hit a bit of a sore spot. On the surface, backup systems sound like a great idea, but in practice these systems are seldom exercised (yes, there are exceptions such as Netflix who are amazing at this …). In the case of data centre failover, operations teams have to consider the potential data loss scenario, given most systems are not replicated in real time, when a fail over is required. Most IT and business teams are not ok with data loss. In my experience, teams have opted to wait for their dependencies to come back online before pulling the trigger to fail over to a secondary data centre, region, or some other system. We decided to not build a backup synchronisation mechanisms here. Instead, we applied a few different patterns to make the service more resilient.
How to avoid testing for this type of failure?
3 — Do less work when your system is impaired, not more. If the system is impaired, do not rely on highly complex and infrequently exercised code to repair the system.
4 — Fail gracefully. When the system is not able to operate at full capacity, throttle requests. While operators (human or otherwise) repair the system, have customers back off.
5 — Optimise for operating in a degraded state instead of failing hard. For example, consider processing stale data from caches for read operations in case of database failure instead to rejecting both read and write operations.
6 — Handle downstream failures gracefully. If your system is unable to communicate with downstream systems, fail gracefully. Exponential back off is a fantastic pattern to utilise here. Instead to trying harder to process requests (point 3), slow down and work less when your down stream systems are telling you to slow down. Consider fast fail mechanisms using decay algorithms to track the liveliness of your down stream systems.
The last thing we thought about is blast radius reduction. We know that complex systems will eventually fail — this is a no brainer. The best that service owners can do when these systems fail is to prevent a full system outage. In our case, all customers experienced some form of outage from the system during the 9 hour incident. Using the patterns above we were able able to limit the blast radius of future incidents — especially when down stream systems failed.
It’s hard to test how your system will operate and fail at scale. Applying counter measures to common scaling challenge will ensure your service can operate in a degraded mode in case of systematic failure of sub-systems. The service has been able to scale gracefully to handle many more millions of requests per second since this outage by horizontally scaling; largely due to the resiliency patterns that we applied.