Structuring a Post Incident Review
Post Incident Reviews (PIR) are a great way to mechanise “diving deep” in any organisation.
Why we write PIRs?
Organisations like AWS and Atlassian often publish Post Incident Reviews -https://aws.amazon.com/message/41926/
https://www.atlassian.com/engineering/post-incident-review-april-2022-outage
These organisations leverage an incident review process to -
- Avoid multi-hour or multi-day outages.
- Learn about scaling cliffs and share learnings.
- And most importantly make permanent improvements.
Outcomes
The outcomes of a really great PIR process are often -
Systematic Improvements — Changes to processes and procedures.
Prevention — Prevent recurrence of issues from the same root cause
Blameless Culture
A lot of teams and companies talk about being blameless and often prioritise a blameless culture ahead of focusing on the real problem. This is a more common problem than you may think. When you experience repeated outages as a result of the same root cause, this is often a flag that teams are not leveraging the PIR process effectively.
Great organisations put in systems to build blameless cultures; not just talk about it 💪.
Things a good PIR should answer
Getting to the nuts and bolts now. If executed well, a great PIR should answer most of the questions below -
- How can we prevent a recurrence?
- What parts of the “system” allowed this failure to happen?
- Are there broken processes?
- Missing process?
- Missing checkpoints?
- and many more …
Structuring a PIR
Structuring and writing a PIR is as much science as it is art. I’ve found that really great PIRs leverage the structure below -
Incident Summary
This section provides a high level overview of the incident answering -
- What happened?
- How it happened?
- Who was impacted?
- How long did it last?
- Why did it happen?
Timeline
In this section, you will often find a chronological ordering of events to explain the events that transpired. This is helpful in improving future processes.
For example — if you see that an engineer engaged at 10 AM but did not take any action till 11 AM, you can dive into what caused that gap. Is the team missing run books or standard operating procedures? Alternatively, the timeline can be used to highlight key learnings/issues. For example, the operator was not able to log into the host due to incorrect credentials.
Whys
This is the most critical part of the PIR. This section dives into the root cause.
Let’s focus on the WHY.
For children, “why” questions help them make sense of the world around them that they are just beginning to learn about (e.g. why is the sky blue?).
It’s not much different than getting thrown into an operational event. Things are new and you are starting to make sense of a world around you. Asking why is a great way to make sense of what’s going on.
Into to five whys
A very common technique to get to a root cause is using the 5 whys technique.
Example
This is a very simple root cause investigation, but ultimately we want to get to a point where we do not have a recurrence of this event because of the same root cause.
In this case, we want to spend a lot of time trying to figure out how an engineer was able to push a bad configuration into production. It is really important to ensure that teams are not focusing on the fact that someone pushed a bad change, rather how the system allowed a bad change to be pushed to production. This is one way to build a blameless culture.
Counter Measures
The 5 Whys uses “counter-measures,” rather than solutions. A counter-measure is an action or set of actions that seeks to prevent the problem arising again, while a solution may just seek to deal with the symptom. As such, counter-measures are more robust, and will more likely prevent the problem from recurring.
— Sakichi Toyoda
Corrective Action
The last section should cover the corrective actions. Corrective actions should work through (1) immediate actions to prevent a recurrence, (2) systematic actions to prevent a recurrence, and finally (3)monitoring to ensure the problem remains fixed.
Prevents a recurrence of the event
- Think about systematic changes here. E.G. Alarms and better telemetry
Removes reliance on “humans”
- Anti pattern. Think about how to automate the action
Good intentions vs. mechanisms
- Common pattern in PIRs — Implement better tests. What does that mean exactly? How do you measure that?
- How about add alarm when coverage drop below XX%?
- Document test cases as part of PR.
- Feature flags.
[Link to an example PIR coming shortly …]