What are some of the foundational graphs that I look for in a quick glance dashboard? This is the dashboard that EVERY engineering manager and on call engineer should be able to pull up when there is an alert to get a 50,000 ft view of their service.
Service Metrics
Key APIs / pages usage metrics
For each of your key APIs (e.g. create, delete, not list)
- Reliability
- number of 200s you are sending vs 500s - Usage volume (N) at a 5-min granularity
- Error codes (500, 403, …)
- Latency
- P50 — This is basically the average experience
- P90 — This is the experience your top 10% of the slowest customers are getting
- P99 — This is the experience your top 1% of the slowest customers are getting - Key domain specific metrics
- queue depth
- dwell time
- email errors
- payment failures - Secondary APIs / pages usage metrics (e.g. list)
- Same as above but separated out from the key metrics.
Dependencies
For each of your dependencies (e.g. Auth, network, …)
- Usage volume (N) at a 5-min granularity
- Error codes (500, 403, …)
- Latency
- P50 — This is basically the average experience
- P90 — This is the experience your top 10% of the slowest customers are getting
- P99 — This is the experience your top 1% of the slowest customers are getting
You will want two sets of graphs. The first set of graphs are from the metrics that you are emitting from your service. The second set of graphs are what your dependencies are emitting. This is helpful when there are last-mile issues such as network level packet loss where the request isn’t reaching the downstream service but you are experiencing some-type of errors because of request time-outs. In this case, the downstream service will look healthy from the service owners perspective even though customers are severely impacted.
Infra (Capacity Planning)
These are your secondary metrics.
Host
Host level metrics I look for (avg across the fleet is usually ok) -
- Various network i/o
- i/o out
- retransmit → indicates packet loss - Disk
- usage
- used - memory
- usage
- used - JVM / VM usage metrics
- heap usage → identifies GC pressure / mem leak
- ..
Network
- General overview of network level graphs
- network bandwidth / used - AZ ↔ AZ canaries
- health check latency
- health check success - Load balancers
- Usage
- Error rates - Certs
- expiry date
Database / Persistence Layer
- Storage
- I/O volume
- Errors
- Health checks
Hygiene
This graph is great!
This graph is better!
Alerts
- The obvious one — when your metrics breach the desired threshold ALERT.
- Don’t get too hung up on what’s the RIGHT or WRONG level to alert at. Ask yourself — At what level is the customer experience no longer acceptable? We are all users of our platform. - Alert when
- there are missing metrics
- sharp increase from 0% to 100% error rate in one data point
— maybe noisy during transient errors but will save you during a bad deployment
- based on heuristics
— sharp increase from 0% to 25%+ error rate in two data point
— sharp increase in latency in two data points