Description

This curriculum spans the technical, operational, and governance dimensions of outage management, comparable in scope to a multi-workshop program developed for enterprise SRE teams implementing availability controls across complex, distributed systems.

Module 1: Defining Availability Requirements and SLIs

Selecting service-level indicators (SLIs) based on user-observable behaviors such as API response success rate, end-to-end transaction latency, and authentication success.
Negotiating SLI definitions with product and SRE teams when backend dependencies obscure frontend reliability metrics.
Determining appropriate measurement windows (e.g., 28-day rolling vs. calendar month) for SLA compliance reporting.
Deciding whether to include retry traffic in error rate calculations and the impact on perceived reliability.
Implementing synthetic transactions to measure availability when real user monitoring (RUM) data is sparse or delayed.
Handling edge cases where SLI breaches occur due to planned maintenance windows or third-party outages.
Aligning SLI thresholds with business impact, such as revenue loss per minute of downtime.
Designing SLI dashboards that differentiate between regional, global, and per-tenant availability.

Module 2: Architecting for Fault Tolerance and Redundancy

Distributing stateful services across availability zones while managing data consistency and replication lag.
Choosing between active-passive and active-active failover models based on recovery time and cost constraints.
Implementing circuit breakers in inter-service communication to prevent cascading failures during dependency outages.
Designing retry policies with exponential backoff and jitter to avoid thundering herd problems.
Validating failover automation through controlled chaos engineering experiments in production-like environments.
Assessing the trade-off between data durability and availability in distributed databases during network partitions.
Integrating health checks at multiple layers (network, application, business logic) to avoid false positives.
Managing shared failure domains in cloud environments, such as hypervisor clusters or storage backplanes.

Module 3: Monitoring, Alerting, and Incident Detection

Configuring alert thresholds using SLO burn rate calculations instead of static error rate limits.
Reducing alert fatigue by suppressing low-severity alerts during ongoing high-severity incidents.
Correlating alerts across microservices to identify root cause signals versus symptom noise.
Implementing dynamic thresholds for metrics with strong diurnal patterns to reduce false positives.
Ensuring monitoring agents and collectors are resilient and do not become single points of failure.
Validating that alerting rules trigger on user-impacting conditions, not just infrastructure anomalies.
Integrating business KPIs (e.g., checkout success rate) into alerting systems to detect silent outages.
Managing alert ownership and escalation paths across time zones in globally distributed teams.

Module 4: Incident Response and Outage Triage

Activating incident response protocols based on predefined severity criteria tied to customer impact.
Assigning incident commander roles and avoiding role duplication during multi-team outages.
Documenting real-time incident timelines using structured event tagging (e.g., detection, mitigation, root cause).
Deciding when to roll back a deployment versus applying a hotfix during an ongoing outage.
Communicating outage status to internal stakeholders without speculating on root cause.
Managing external communication through designated spokespersons during public-facing incidents.
Using feature flags to isolate faulty components without full service rollback.
Preserving logs, metrics, and traces from the time of incident for forensic analysis.

Module 5: Root Cause Analysis and Post-Incident Review

Conducting blameless postmortems that focus on systemic factors rather than individual error.
Using timeline reconstruction to identify detection latency and response delays in outage handling.
Classifying contributing factors into categories such as design flaw, configuration drift, or monitoring gap.
Prioritizing action items from postmortems based on recurrence risk and implementation effort.
Tracking remediation tasks in issue trackers with ownership and deadlines to ensure follow-through.
Sharing postmortem findings across engineering teams to prevent similar failures in other services.
Deciding when to classify an incident as a security event requiring separate forensic investigation.
Archiving postmortem documents in a searchable knowledge base for future reference.

Module 6: Change Management and Deployment Safety

Requiring canary analysis before full rollout, comparing error rates and latency between canary and baseline.
Blocking deployments during active incidents unless the change is part of the mitigation plan.
Implementing deployment freezes during high-business-impact periods and defining rollback criteria.
Validating infrastructure-as-code changes in staging environments that mirror production topology.
Using feature flags with gradual rollouts to limit blast radius of new code paths.
Enforcing peer review of configuration changes to critical systems such as load balancers and DNS.
Automating pre-deployment checks for SLO compliance and dependency health.
Tracking deployment metadata (who, what, when) in audit logs for incident correlation.

Module 7: Dependency and Third-Party Risk Management

Mapping transitive dependencies to assess exposure from indirect third-party services.
Negotiating SLAs with external vendors and defining remedies for non-compliance.
Implementing client-side fallbacks or cached responses when external APIs degrade.
Monitoring third-party endpoints with synthetic checks independent of production traffic.
Designing integration points with timeouts, retries, and circuit breakers to limit dependency impact.
Conducting business impact analysis when a critical vendor announces service deprecation.
Validating failover to secondary providers when primary service becomes unreachable.
Requiring contractual right-to-audit clauses for vendors managing core infrastructure components.

Module 8: Capacity Planning and Load Management

Forecasting capacity needs using historical growth trends and upcoming product launches.
Implementing autoscaling policies that respond to both traffic load and error rate increases.
Simulating traffic spikes using load testing to validate scaling limits and bottleneck points.
Setting concurrency limits on critical endpoints to prevent resource exhaustion.
Managing cold start risks in serverless environments during sudden traffic surges.
Deciding when to throttle non-critical traffic during overload to preserve core functionality.
Right-sizing instance types based on memory, CPU, and I/O profiles of workloads.
Coordinating capacity updates with dependent teams to avoid inter-service bottlenecks.

Module 9: Governance, Compliance, and Audit Readiness

Documenting availability controls to meet regulatory requirements such as SOC 2 or ISO 27001.
Producing availability reports for executives and board members using standardized templates.
Aligning incident response procedures with organizational incident classification policies.
Retaining monitoring data and incident logs for legally mandated retention periods.
Conducting internal audits of SLO adherence and remediation completion rates.
Managing access controls for outage-related systems (e.g., PagerDuty, monitoring tools) using least privilege.
Reviewing third-party risk assessments annually for vendors with access to critical systems.
Updating business continuity plans to reflect changes in system architecture and dependencies.