My note-taking while reading the book Practical Monitoring: Effective Strategies for the Real World. I only keep what I think is helpful, discard all unneccesary points. It is not a book summary.

Monitoring anti-patterns

  1. Tool obsession: use as few tools as you need
  2. Monitoring is everyone's job, not a single role
  3. Great monitoring is more than checking the box marked: "We have monitoring"
  4. Monitoring does not fix the broken things

Monitoring patterns

  1. Composable monitoring
  • use multiple specifized tools and couple them loosely together:
    • benefit:
      • flexible
      • no committed long-term to a particular tool
    • break down a monitoring service:
      • data collection
        • pull model: require a central system to keep track of all known clients, handling scheduling, parse returning data.
        • push model
      • data storage:
        • Time series DB (TSDB)
        • Data rollup: summary old data to higher resolution.
      • visualization
      • analytics and reporting
      • alerting
        • monitoring doesn’t exist to generate alerts: alerts are just one possible outcome.

        Monitoring is for asking questions.

  1. Starting monitoring at User perspective
  • Start from
    • HTTP response code
    • Latency
  • Then
    • CPU
    • Memory ...
  1. Tile business metrics to technical metrics
  • User logins => User login failures, login latency
  • Comments submitted => Comment submission failures, submission latency

Alert

  1. Stop using email
  • a great way to overwhelm everyone with noise
  • noisy alerts cause people to stop trusting the monitoring system, which leads people to ignoring it entirely
  1. Delete and Tune Alerts
  • fewer alert

Tactics

Monitor the application
  1. Health check
  • DB, cache
  • dependant APIs
Monitor server
  • CPU
  • Memory

Pay attention to the second row. It indicates that the system has 360MB/488MB free. Dont count the buffer and cache because they are used to store metadata, recently accessed area ... => available for use by any processes that need memory.

  • QPS: throughput
  • HTTP response code
  • Database server
    • number of connection: indicate traffic level
    • qps: how busy it is
    • slow query
    • replicationg delay
    • IOPS
  • Message queues
    • queue length: number of messages on a queue waiting to be taken off
    • consumption rate: messages per second
  • Cache
    • hit rate
    • number of evicted items
    • latency