Description

This curriculum spans the technical and organisational complexity of a multi-phase observability rollout, comparable to an internal capability build supported by advisory engagements across SRE, security, and platform teams.

Module 1: Designing Monitoring Strategy and Tool Selection

Selecting between open-source (e.g., Prometheus, Grafana) and commercial (e.g., Datadog, New Relic) tools based on team size, compliance needs, and long-term TCO.
Defining ownership boundaries between SRE, DevOps, and application teams for monitoring coverage and alert responsibility.
Establishing criteria for monitoring coverage: deciding which services require full observability versus minimal metrics collection.
Negotiating data retention policies with legal and security teams to balance compliance with storage costs.
Integrating monitoring tools into existing CI/CD pipelines without introducing deployment bottlenecks.
Standardizing naming conventions and labeling strategies across teams to ensure consistent metric tagging and querying.

Module 2: Instrumenting Applications and Services

Deciding when to use auto-instrumentation versus manual instrumentation based on language, framework, and performance requirements.
Configuring custom metrics in microservices to capture business-relevant KPIs without overloading the monitoring backend.
Implementing structured logging with consistent schema (e.g., JSON) across services to enable reliable parsing and correlation.
Choosing between push (e.g., StatsD) and pull (e.g., Prometheus scraping) models based on network topology and scalability constraints.
Managing version skew between instrumentation libraries and monitoring backends during phased rollouts.
Securing telemetry data in transit using mTLS or API key rotation, particularly in multi-tenant environments.

Module 3: Infrastructure and Host-Level Monitoring

Configuring agent-based monitoring (e.g., Telegraf, CloudWatch Agent) on VMs with minimal CPU/memory overhead.
Monitoring ephemeral container workloads in Kubernetes using sidecar exporters and node-level collectors.
Handling high-cardinality labels in metrics (e.g., pod names, request IDs) that can degrade time-series database performance.
Setting up host-level health checks that trigger automated remediation or failover without causing alert storms.
Correlating infrastructure metrics (CPU, disk I/O) with application latency to isolate performance bottlenecks.
Managing agent deployment and configuration at scale using configuration management tools (e.g., Ansible, Puppet).

Module 4: Distributed Tracing and Service Observability

Implementing trace context propagation across service boundaries using W3C Trace Context or OpenTelemetry standards.
Sampling high-volume traces to reduce storage costs while preserving visibility into error paths and slow transactions.
Integrating tracing into legacy systems that lack native context propagation support.
Mapping trace data to service ownership for accurate SLI/SLO calculations and incident accountability.
Filtering sensitive data (e.g., PII, tokens) from traces before export to comply with privacy regulations.
Aligning span naming conventions across teams to enable consistent service map generation and dependency analysis.

Module 5: Alerting and Incident Response

Writing alerting rules that minimize false positives by incorporating duration, thresholds, and historical baselines.
Routing alerts to on-call engineers using escalation policies in tools like PagerDuty or Opsgenie based on service criticality.
Designing alert fatigue mitigation strategies, including alert grouping, throttling, and auto-resolution.
Integrating runbook links and diagnostic commands directly into alert notifications for faster triage.
Conducting blameless postmortems to update alerting rules based on incident root causes.
Testing alert delivery and on-call workflows through scheduled fire drills and synthetic failures.

Module 6: Monitoring in CI/CD and Pre-Production Environments

Enabling monitoring in ephemeral environments (e.g., PR branches, staging) without incurring unnecessary costs.
Using synthetic transactions to validate health and performance before promoting to production.
Comparing performance metrics across environments to detect configuration drift or resource constraints.
Blocking CI/CD pipelines based on SLO violations or critical alert triggers during canary deployments.
Archiving monitoring data from short-lived environments for audit and debugging purposes.
Securing access to pre-production monitoring dashboards to prevent exposure of test data.

Module 7: Capacity Planning and Performance Benchmarking

Using historical metric trends to forecast infrastructure scaling needs and justify budget requests.
Establishing baseline performance profiles for services to detect degradation after deployments.
Correlating monitoring data with cost allocation tools to identify underutilized or overprovisioned resources.
Running load tests in production-like environments and comparing results against monitoring data.
Setting up automated scaling policies based on real-time metrics (e.g., CPU, request queue depth).
Documenting capacity thresholds and scaling procedures in shared runbooks for operational consistency.

Module 8: Governance, Compliance, and Monitoring Maturity

Defining and auditing monitoring coverage standards across business units to ensure regulatory compliance.
Implementing role-based access control (RBAC) in monitoring platforms to restrict sensitive data exposure.
Conducting quarterly reviews of alerting rules and dashboards to remove obsolete or redundant configurations.
Measuring monitoring maturity using frameworks like the Observability Maturity Model (OMM).
Integrating monitoring data into internal audit workflows for SOC 2, ISO 27001, or HIPAA compliance.
Establishing cross-team feedback loops to refine monitoring practices based on incident reviews and user feedback.