Ideal Quick Glance Dashboard

What are some of the foundational graphs that I look for in a quick glance dashboard? This is the dashboard that EVERY engineering manager and on call engineer should be able to pull up when there is an alert to get a 50,000 ft view of their service.

Service Metrics

For each of your key APIs (e.g. create, delete, not list)

  • Reliability
    - number of 200s you are sending vs 500s
  • Usage volume (N) at a 5-min granularity
  • Error codes (500, 403, …)
  • Latency
    - P50 — This is basically the average experience
    - P90 — This is the experience your top 10% of the slowest customers are getting
    - P99 — This is the experience your top 1% of the slowest customers are getting
  • Key domain specific metrics
    - queue depth
    - dwell time
    - email errors
    - payment failures
  • Secondary APIs / pages usage metrics (e.g. list)
    - Same as above but separated out from the key metrics.

Dependencies

  • Usage volume (N) at a 5-min granularity
  • Error codes (500, 403, …)
  • Latency
    - P50 — This is basically the average experience
    - P90 — This is the experience your top 10% of the slowest customers are getting
    - P99 — This is the experience your top 1% of the slowest customers are getting

You will want two sets of graphs. The first set of graphs are from the metrics that you are emitting from your service. The second set of graphs are what your dependencies are emitting. This is helpful when there are last-mile issues such as network level packet loss where the request isn’t reaching the downstream service but you are experiencing some-type of errors because of request time-outs. In this case, the downstream service will look healthy from the service owners perspective even though customers are severely impacted.

Infra (Capacity Planning)

Host

  • Various network i/o
    - i/o out
    - retransmit → indicates packet loss
  • Disk
    - usage
    - used
  • memory
    - usage
    - used
  • JVM / VM usage metrics
    - heap usage → identifies GC pressure / mem leak
    - ..

Network

  • AZ ↔ AZ canaries
    - health check latency
    - health check success
  • Load balancers
    - Usage
    - Error rates
  • Certs
    - expiry date

Database / Persistence Layer

  • I/O volume
  • Errors
  • Health checks

Hygiene

This graph is better!

This graph is better because it clearly highlights the what to look for. When the error rate breaches the threshold, we should have an alert.

Alerts

  • Alert when
    - there are missing metrics
    - sharp increase from 0% to 100% error rate in one data point
    — maybe noisy during transient errors but will save you during a bad deployment
    - based on heuristics
    — sharp increase from 0% to 25%+ error rate in two data point
    — sharp increase in latency in two data points

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store