This curriculum spans the technical, operational, and governance dimensions of outage management, comparable in scope to a multi-workshop program developed for enterprise SRE teams implementing availability controls across complex, distributed systems.
Module 1: Defining Availability Requirements and SLIs
- Selecting service-level indicators (SLIs) based on user-observable behaviors such as API response success rate, end-to-end transaction latency, and authentication success.
- Negotiating SLI definitions with product and SRE teams when backend dependencies obscure frontend reliability metrics.
- Determining appropriate measurement windows (e.g., 28-day rolling vs. calendar month) for SLA compliance reporting.
- Deciding whether to include retry traffic in error rate calculations and the impact on perceived reliability.
- Implementing synthetic transactions to measure availability when real user monitoring (RUM) data is sparse or delayed.
- Handling edge cases where SLI breaches occur due to planned maintenance windows or third-party outages.
- Aligning SLI thresholds with business impact, such as revenue loss per minute of downtime.
- Designing SLI dashboards that differentiate between regional, global, and per-tenant availability.
Module 2: Architecting for Fault Tolerance and Redundancy
- Distributing stateful services across availability zones while managing data consistency and replication lag.
- Choosing between active-passive and active-active failover models based on recovery time and cost constraints.
- Implementing circuit breakers in inter-service communication to prevent cascading failures during dependency outages.
- Designing retry policies with exponential backoff and jitter to avoid thundering herd problems.
- Validating failover automation through controlled chaos engineering experiments in production-like environments.
- Assessing the trade-off between data durability and availability in distributed databases during network partitions.
- Integrating health checks at multiple layers (network, application, business logic) to avoid false positives.
- Managing shared failure domains in cloud environments, such as hypervisor clusters or storage backplanes.
Module 3: Monitoring, Alerting, and Incident Detection
- Configuring alert thresholds using SLO burn rate calculations instead of static error rate limits.
- Reducing alert fatigue by suppressing low-severity alerts during ongoing high-severity incidents.
- Correlating alerts across microservices to identify root cause signals versus symptom noise.
- Implementing dynamic thresholds for metrics with strong diurnal patterns to reduce false positives.
- Ensuring monitoring agents and collectors are resilient and do not become single points of failure.
- Validating that alerting rules trigger on user-impacting conditions, not just infrastructure anomalies.
- Integrating business KPIs (e.g., checkout success rate) into alerting systems to detect silent outages.
- Managing alert ownership and escalation paths across time zones in globally distributed teams.
Module 4: Incident Response and Outage Triage
- Activating incident response protocols based on predefined severity criteria tied to customer impact.
- Assigning incident commander roles and avoiding role duplication during multi-team outages.
- Documenting real-time incident timelines using structured event tagging (e.g., detection, mitigation, root cause).
- Deciding when to roll back a deployment versus applying a hotfix during an ongoing outage.
- Communicating outage status to internal stakeholders without speculating on root cause.
- Managing external communication through designated spokespersons during public-facing incidents.
- Using feature flags to isolate faulty components without full service rollback.
- Preserving logs, metrics, and traces from the time of incident for forensic analysis.
Module 5: Root Cause Analysis and Post-Incident Review
- Conducting blameless postmortems that focus on systemic factors rather than individual error.
- Using timeline reconstruction to identify detection latency and response delays in outage handling.
- Classifying contributing factors into categories such as design flaw, configuration drift, or monitoring gap.
- Prioritizing action items from postmortems based on recurrence risk and implementation effort.
- Tracking remediation tasks in issue trackers with ownership and deadlines to ensure follow-through.
- Sharing postmortem findings across engineering teams to prevent similar failures in other services.
- Deciding when to classify an incident as a security event requiring separate forensic investigation.
- Archiving postmortem documents in a searchable knowledge base for future reference.
Module 6: Change Management and Deployment Safety
- Requiring canary analysis before full rollout, comparing error rates and latency between canary and baseline.
- Blocking deployments during active incidents unless the change is part of the mitigation plan.
- Implementing deployment freezes during high-business-impact periods and defining rollback criteria.
- Validating infrastructure-as-code changes in staging environments that mirror production topology.
- Using feature flags with gradual rollouts to limit blast radius of new code paths.
- Enforcing peer review of configuration changes to critical systems such as load balancers and DNS.
- Automating pre-deployment checks for SLO compliance and dependency health.
- Tracking deployment metadata (who, what, when) in audit logs for incident correlation.
Module 7: Dependency and Third-Party Risk Management
- Mapping transitive dependencies to assess exposure from indirect third-party services.
- Negotiating SLAs with external vendors and defining remedies for non-compliance.
- Implementing client-side fallbacks or cached responses when external APIs degrade.
- Monitoring third-party endpoints with synthetic checks independent of production traffic.
- Designing integration points with timeouts, retries, and circuit breakers to limit dependency impact.
- Conducting business impact analysis when a critical vendor announces service deprecation.
- Validating failover to secondary providers when primary service becomes unreachable.
- Requiring contractual right-to-audit clauses for vendors managing core infrastructure components.
Module 8: Capacity Planning and Load Management
- Forecasting capacity needs using historical growth trends and upcoming product launches.
- Implementing autoscaling policies that respond to both traffic load and error rate increases.
- Simulating traffic spikes using load testing to validate scaling limits and bottleneck points.
- Setting concurrency limits on critical endpoints to prevent resource exhaustion.
- Managing cold start risks in serverless environments during sudden traffic surges.
- Deciding when to throttle non-critical traffic during overload to preserve core functionality.
- Right-sizing instance types based on memory, CPU, and I/O profiles of workloads.
- Coordinating capacity updates with dependent teams to avoid inter-service bottlenecks.
Module 9: Governance, Compliance, and Audit Readiness
- Documenting availability controls to meet regulatory requirements such as SOC 2 or ISO 27001.
- Producing availability reports for executives and board members using standardized templates.
- Aligning incident response procedures with organizational incident classification policies.
- Retaining monitoring data and incident logs for legally mandated retention periods.
- Conducting internal audits of SLO adherence and remediation completion rates.
- Managing access controls for outage-related systems (e.g., PagerDuty, monitoring tools) using least privilege.
- Reviewing third-party risk assessments annually for vendors with access to critical systems.
- Updating business continuity plans to reflect changes in system architecture and dependencies.