Structuring a Post Incident Review

Post Incident Reviews (PIR) are a great way to mechanise “diving deep” in any organisation.

Why we write PIRs?

These organisations leverage an incident review process to -
- Avoid multi-hour or multi-day outages.
- Learn about scaling cliffs and share learnings.
- And most importantly make permanent improvements.

Outcomes

Blameless Culture

Great organisations put in systems to build blameless cultures; not just talk about it 💪.

Things a good PIR should answer

  • How can we prevent a recurrence?
  • What parts of the “system” allowed this failure to happen?
  • Are there broken processes?
  • Missing process?
  • Missing checkpoints?
  • and many more …

Structuring a PIR

Incident Summary

  • What happened?
  • How it happened?
  • Who was impacted?
  • How long did it last?
  • Why did it happen?

Timeline

For example — if you see that an engineer engaged at 10 AM but did not take any action till 11 AM, you can dive into what caused that gap. Is the team missing run books or standard operating procedures? Alternatively, the timeline can be used to highlight key learnings/issues. For example, the operator was not able to log into the host due to incorrect credentials.

Whys

This is the most critical part of the PIR. This section dives into the root cause.

Let’s focus on the WHY.

For children, “why” questions help them make sense of the world around them that they are just beginning to learn about (e.g. why is the sky blue?).

It’s not much different than getting thrown into an operational event. Things are new and you are starting to make sense of a world around you. Asking why is a great way to make sense of what’s going on.

Into to five whys
A very common technique to get to a root cause is using the 5 whys technique.

Example

This is a very simple root cause investigation, but ultimately we want to get to a point where we do not have a recurrence of this event because of the same root cause.

In this case, we want to spend a lot of time trying to figure out how an engineer was able to push a bad configuration into production. It is really important to ensure that teams are not focusing on the fact that someone pushed a bad change, rather how the system allowed a bad change to be pushed to production. This is one way to build a blameless culture.

Counter Measures

The 5 Whys uses “counter-measures,” rather than solutions. A counter-measure is an action or set of actions that seeks to prevent the problem arising again, while a solution may just seek to deal with the symptom. As such, counter-measures are more robust, and will more likely prevent the problem from recurring.

— Sakichi Toyoda

Corrective Action

Prevents a recurrence of the event
-
Think about systematic changes here. E.G. Alarms and better telemetry

Removes reliance on “humans”
-
Anti pattern. Think about how to automate the action

Good intentions vs. mechanisms
-
Common pattern in PIRs — Implement better tests. What does that mean exactly? How do you measure that?
- How about add alarm when coverage drop below XX%?
- Document test cases as part of PR.
- Feature flags.

[Link to an example PIR coming shortly …]

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store