Back in 2015 at AWS, I took over a product that was riddled with technical debt. We managed between 70–100 high severity event on a good week. The system did not have basic protocols like rate limiting on the APIs, prioritisation of work, or defensive capabilities such as isolating noisy neighbours but it easily processed many millions of requests per second. In other words, it was a massive cluster f&^k under the hood but it worked. When we did our first assessment of the system, we honestly didn’t know where to start. The challenge ahead of us seemed unsurmountable. We knew a rewrite of the system was necessary but we’d have to earn the right to rewrite the system. One of the core engineering principles that we had at AWS was -
[…] Engineers are grateful to our predecessors. We appreciate the value of working systems and the lessons they embody. We understand that many problems are not essentially new.
Simply put, this means respect what came before. At the time I don’t think we really appreciated how this principle would shape how we would think as engineers and leaders as we progressed in our career.
It was tempting to rewrite the system at the time. In hind sight, I am really glad we didn’t. Later in my career, I learned that this temptation had a name -
The second-system effect (also known as second-system syndrome) is the tendency of small, elegant, and successful systems to be succeeded by over-engineered, bloated systems, due to inflated expectations and overconfidence.
While the system was in a less than desirable state at the time, it didn’t need a rewrite; it just needed some love ❤.
Transitional architecture is really a pretty way of saying iterative development. This means, take small steps forward. Going back to this system — we decided to take an iterative approach to fixing the mess that we inherited; putting aside our desire to rewrite it. We had a big hill to climb, but we had to take the first step. First, we looked at all the issues we had and created a chart breaking out the problems. The pareto looked something like this …
This picture gave us an interesting view (numbers are made up). We knew that there were two big problems with the system once we looked at the root causes visualised in the graph above. We needed to add rate limiting to the system and then figure out how to scale Service X so it didn’t crash under load. We broke the small team into squads that focused on fixing these two specific issues. We accepted the fact that we had other parts of the system that were also a total mess but chose to accept the problems — knowing they were solving problems for our customers but creating operational over head for us. A few sprints went by (which felt like an eternity), and we finally implemented the first iteration of rate limiting. Hurray …. right? WRONG!
Our first iteration of the rate limiting system was a static file that was built into the package that was deployed to production. This means we would have to update the file, check it in for code review, build it, then deploy it to production — in the middle of an incident! This sounds nasty but it worked! We didn’t reduce the number of incidents but the mean time to resolution (MTTR) of our incidents dropped in half. This was a huge win.
Next we needed to figure out how we could apply configuration changes to production without having to build the code and then deploy it. We replaced the static config file with a reference to a file that sat in a S3 bucket. We then built a separate tool that could push changes from the developers desktop to this file (with the appropriate checks and balances) and we had updated the service to reload this file every few minutes. On the surface this sounds easy, but we had to think through this more deeply. We had to make sure the service would work even if S3 failed. We had to figure out if the service would fail open or apply a default limit across all customers, handle eventually consistency across the fleet, and so on. These were hard problems to solve but we took our time answering them — knowing we had iteration 1 in production already. This bought us a ton of time.
Eventually we answered all the hard questions and got iteration 2 out to production. Our appetitive grew at this point. We reduced MTTR even more but engineers wished we had a dynamic system that would rate limit incoming requests automatically. We felt we NEEDED this! We HAD to have this system to reduce our operational load even more!
To build this system, we would need -
- A low latency data distribution protocol to report request-rate-by-customer from hosts across the fleet
- A semi-decentralised service that can make some decision based on heuristics
- A automated system to update request limits
- Oh, and this system had to be highly resilient and fault tolerant
After looking at the cost and complexity of this system, we decided to not build it. We collectively agreed that the system we built in iteration 2 was good enough — it gave us what we needed, worked well, and most importantly it was reliable.
The perfect system
I reflect on my time on this team from time to time and specifically the rate limiting project. We knew we needed to get to iteration 3 at some point. If we had started there, we would have built a highly complex system that would check all the boxes, cost us a lot of time, and return only some of the value of the original problem we were trying to solve while creating new problems along the way.
We took a similarly iterative approach to scaling the various services, implementing retries between services, and improving performance to the point where the code base only had similarities to what we had originally inherited. It’s unlikely that we would have been able to rewrite the system while managing the operational overhead. We would have ended up with second-system syndrome, a long project with low team morale, and new types of technical challenges. Using the principle of respecting what came before encouraged us to appreciate the value of working systems and the lessons they embody. This is one of the most powerful lessons I’ve learned in my career and one I hope to pass on to anyone that is open to learning.
The take aways
My time on this team taught me a lot of valuable lessons. Two of the most important ones are -
- Avoid the temptation to rewrite — On the surface, rewriting systems feel like the right thing to do to avoid the inherent complexities. When you appreciate the value of working systems and the lessons they embody rewrites are often less appealing. (Sometimes you have to rewrite a system. That’s ok. When you have to rewrite it, stick to replicating what works — stick with what works and make the smallest amount of change necessary)
- Transitional architecture — North stars are good to have but often they should be left as just that — a vision. An iterative or transitional architecture can help eliminate unnecessary complexity and deliver a solution that is good enough.
These are lessons that come from experience and ones I hope others won’t have to learn the hard way.