This curriculum spans the technical and organisational complexity of a multi-phase observability rollout, comparable to an internal capability build supported by advisory engagements across SRE, security, and platform teams.
Module 1: Designing Monitoring Strategy and Tool Selection
- Selecting between open-source (e.g., Prometheus, Grafana) and commercial (e.g., Datadog, New Relic) tools based on team size, compliance needs, and long-term TCO.
- Defining ownership boundaries between SRE, DevOps, and application teams for monitoring coverage and alert responsibility.
- Establishing criteria for monitoring coverage: deciding which services require full observability versus minimal metrics collection.
- Negotiating data retention policies with legal and security teams to balance compliance with storage costs.
- Integrating monitoring tools into existing CI/CD pipelines without introducing deployment bottlenecks.
- Standardizing naming conventions and labeling strategies across teams to ensure consistent metric tagging and querying.
Module 2: Instrumenting Applications and Services
- Deciding when to use auto-instrumentation versus manual instrumentation based on language, framework, and performance requirements.
- Configuring custom metrics in microservices to capture business-relevant KPIs without overloading the monitoring backend.
- Implementing structured logging with consistent schema (e.g., JSON) across services to enable reliable parsing and correlation.
- Choosing between push (e.g., StatsD) and pull (e.g., Prometheus scraping) models based on network topology and scalability constraints.
- Managing version skew between instrumentation libraries and monitoring backends during phased rollouts.
- Securing telemetry data in transit using mTLS or API key rotation, particularly in multi-tenant environments.
Module 3: Infrastructure and Host-Level Monitoring
- Configuring agent-based monitoring (e.g., Telegraf, CloudWatch Agent) on VMs with minimal CPU/memory overhead.
- Monitoring ephemeral container workloads in Kubernetes using sidecar exporters and node-level collectors.
- Handling high-cardinality labels in metrics (e.g., pod names, request IDs) that can degrade time-series database performance.
- Setting up host-level health checks that trigger automated remediation or failover without causing alert storms.
- Correlating infrastructure metrics (CPU, disk I/O) with application latency to isolate performance bottlenecks.
- Managing agent deployment and configuration at scale using configuration management tools (e.g., Ansible, Puppet).
Module 4: Distributed Tracing and Service Observability
- Implementing trace context propagation across service boundaries using W3C Trace Context or OpenTelemetry standards.
- Sampling high-volume traces to reduce storage costs while preserving visibility into error paths and slow transactions.
- Integrating tracing into legacy systems that lack native context propagation support.
- Mapping trace data to service ownership for accurate SLI/SLO calculations and incident accountability.
- Filtering sensitive data (e.g., PII, tokens) from traces before export to comply with privacy regulations.
- Aligning span naming conventions across teams to enable consistent service map generation and dependency analysis.
Module 5: Alerting and Incident Response
- Writing alerting rules that minimize false positives by incorporating duration, thresholds, and historical baselines.
- Routing alerts to on-call engineers using escalation policies in tools like PagerDuty or Opsgenie based on service criticality.
- Designing alert fatigue mitigation strategies, including alert grouping, throttling, and auto-resolution.
- Integrating runbook links and diagnostic commands directly into alert notifications for faster triage.
- Conducting blameless postmortems to update alerting rules based on incident root causes.
- Testing alert delivery and on-call workflows through scheduled fire drills and synthetic failures.
Module 6: Monitoring in CI/CD and Pre-Production Environments
- Enabling monitoring in ephemeral environments (e.g., PR branches, staging) without incurring unnecessary costs.
- Using synthetic transactions to validate health and performance before promoting to production.
- Comparing performance metrics across environments to detect configuration drift or resource constraints.
- Blocking CI/CD pipelines based on SLO violations or critical alert triggers during canary deployments.
- Archiving monitoring data from short-lived environments for audit and debugging purposes.
- Securing access to pre-production monitoring dashboards to prevent exposure of test data.
Module 7: Capacity Planning and Performance Benchmarking
- Using historical metric trends to forecast infrastructure scaling needs and justify budget requests.
- Establishing baseline performance profiles for services to detect degradation after deployments.
- Correlating monitoring data with cost allocation tools to identify underutilized or overprovisioned resources.
- Running load tests in production-like environments and comparing results against monitoring data.
- Setting up automated scaling policies based on real-time metrics (e.g., CPU, request queue depth).
- Documenting capacity thresholds and scaling procedures in shared runbooks for operational consistency.
Module 8: Governance, Compliance, and Monitoring Maturity
- Defining and auditing monitoring coverage standards across business units to ensure regulatory compliance.
- Implementing role-based access control (RBAC) in monitoring platforms to restrict sensitive data exposure.
- Conducting quarterly reviews of alerting rules and dashboards to remove obsolete or redundant configurations.
- Measuring monitoring maturity using frameworks like the Observability Maturity Model (OMM).
- Integrating monitoring data into internal audit workflows for SOC 2, ISO 27001, or HIPAA compliance.
- Establishing cross-team feedback loops to refine monitoring practices based on incident reviews and user feedback.