Ideal Quick Glance Dashboard

3 min readJul 6, 2022

What are some of the foundational graphs that I look for in a quick glance dashboard? This is the dashboard that EVERY engineering manager and on call engineer should be able to pull up when there is an alert to get a 50,000 ft view of their service.

Service Metrics

Key APIs / pages usage metrics

For each of your key APIs (e.g. create, delete, not list)

Reliability
- number of 200s you are sending vs 500s
Usage volume (N) at a 5-min granularity
Error codes (500, 403, …)
Latency
- P50 — This is basically the average experience
- P90 — This is the experience your top 10% of the slowest customers are getting
- P99 — This is the experience your top 1% of the slowest customers are getting
Key domain specific metrics
- queue depth
- dwell time
- email errors
- payment failures
Secondary APIs / pages usage metrics (e.g. list)
- Same as above but separated out from the key metrics.

Dependencies

For each of your dependencies (e.g. Auth, network, …)

Usage volume (N) at a 5-min granularity
Error codes (500, 403, …)
Latency
- P50 — This is basically the average experience
- P90 — This is the experience your top 10% of the slowest customers are getting
- P99 — This is the experience your top 1% of the slowest customers are getting

You will want two sets of graphs. The first set of graphs are from the metrics that you are emitting from your service. The second set of graphs are what your dependencies are emitting. This is helpful when there are last-mile issues such as network level packet loss where the request isn’t reaching the downstream service but you are experiencing some-type of errors because of request time-outs. In this case, the downstream service will look healthy from the service owners perspective even though customers are severely impacted.

Infra (Capacity Planning)

These are your secondary metrics.

Host

Host level metrics I look for (avg across the fleet is usually ok) -

Various network i/o
- i/o out
- retransmit → indicates packet loss
Disk
- usage
- used
memory
- usage
- used
JVM / VM usage metrics
- heap usage → identifies GC pressure / mem leak
- ..

Network

General overview of network level graphs
- network bandwidth / used
AZ ↔ AZ canaries
- health check latency
- health check success
Load balancers
- Usage
- Error rates
Certs
- expiry date

Database / Persistence Layer

Storage
I/O volume
Errors
Health checks

Hygiene

This graph is great!

This graph is better!

Alerts

The obvious one — when your metrics breach the desired threshold ALERT.
- Don’t get too hung up on what’s the RIGHT or WRONG level to alert at. Ask yourself — At what level is the customer experience no longer acceptable? We are all users of our platform.
Alert when
- there are missing metrics
- sharp increase from 0% to 100% error rate in one data point
— maybe noisy during transient errors but will save you during a bad deployment
- based on heuristics
— sharp increase from 0% to 25%+ error rate in two data point
— sharp increase in latency in two data points