Defining Incident Severity
I often get questions about how to define an incident severity matrix. There are multiple dimensions that can be used to define the severity of an incident. The most popular dimensions are -
- Customer Impact — How many customers are impacted by an incident
> SEV1 when no customer can access their pages.
- Duration — How long an incident has been on going?
> SEV2 if the incident has been ongoing for 4 hours.
- Impact to business — What’s the potential impact to the business (e.g. revenue impact, churn, …)?
> SEV1 if the incident may results in >$1M in bookings impact (e.g. kayak.com)
In my experience, leveraging customer impact as the primary source for defining the incident severity yields the best results. This is because incidents ultimately impact customers and bringing services back to good heath should be the primary focus for any team (next to security) when our services are impaired.
The numbers in the impact column should be derived from the % of hourly active users (HAU) and should be refined over time when the HAU materially moves.
🤔 Does the "Tier" of the service matter?
I don't think it should matter. If you are a non-critical service, you are unlikely to be serving up requests that impact more than 1,000 customers at any given time.