Ideal Quick Glance Dashboard

Zak Islam
3 min readJul 6, 2022

What are some of the foundational graphs that I look for in a quick glance dashboard? This is the dashboard that EVERY engineering manager and on call engineer should be able to pull up when there is an alert to get a 50,000 ft view of their service.

Service Metrics

Key APIs / pages usage metrics

For each of your key APIs (e.g. create, delete, not list)

  • Reliability
    - number of 200s you are sending vs 500s
  • Usage volume (N) at a 5-min granularity
  • Error codes (500, 403, …)
  • Latency
    - P50 — This is basically the average experience
    - P90 — This is the experience your top 10% of the slowest customers are getting
    - P99 — This is the experience your top 1% of the slowest customers are getting
  • Key domain specific metrics
    - queue depth
    - dwell time
    - email errors
    - payment failures
  • Secondary APIs / pages usage metrics (e.g. list)
    - Same as above but separated out from the key metrics.

Dependencies

For each of your dependencies (e.g. Auth, network, …)

  • Usage volume (N) at a 5-min granularity
  • Error codes (500, 403, …)
  • Latency
    - P50 — This is basically the average experience
    - P90 — This is the experience your top 10% of the slowest customers are getting
    - P99 — This is the experience your top 1% of the slowest customers are getting

You will want two sets of graphs. The first set of graphs are from the metrics that you are emitting from your service. The second set of graphs are what your dependencies are emitting. This is helpful when there are last-mile issues such as network level packet loss where the request isn’t reaching the downstream service but you are experiencing some-type of errors because of request time-outs. In this case, the downstream service will look healthy from the service owners perspective even though customers are severely impacted.

Infra (Capacity Planning)

These are your secondary metrics.

Host

Host level metrics I look for (avg across the fleet is usually ok) -

  • Various network i/o
    - i/o out
    - retransmit → indicates packet loss
  • Disk
    - usage
    - used
  • memory
    - usage
    - used
  • JVM / VM usage metrics
    - heap usage → identifies GC pressure / mem leak
    - ..

Network

  • General overview of network level graphs
    - network bandwidth / used
  • AZ ↔ AZ canaries
    - health check latency
    - health check success
  • Load balancers
    - Usage
    - Error rates
  • Certs
    - expiry date

Database / Persistence Layer

  • Storage
  • I/O volume
  • Errors
  • Health checks

Hygiene

This graph is great!

This graph is better!

This graph is better because it clearly highlights the what to look for. When the error rate breaches the threshold, we should have an alert.

Alerts

  • The obvious one — when your metrics breach the desired threshold ALERT.
    - Don’t get too hung up on what’s the RIGHT or WRONG level to alert at. Ask yourself — At what level is the customer experience no longer acceptable? We are all users of our platform.
  • Alert when
    - there are missing metrics
    - sharp increase from 0% to 100% error rate in one data point
    — maybe noisy during transient errors but will save you during a bad deployment
    - based on heuristics
    — sharp increase from 0% to 25%+ error rate in two data point
    — sharp increase in latency in two data points

--

--