This curriculum spans the breadth of a multi-workshop operational resilience program, covering the technical, procedural, and governance practices required to maintain service availability across complex, distributed systems.
Module 1: Defining and Measuring Service Availability
- Selecting appropriate availability metrics (e.g., uptime percentage, MTBF, MTTR) based on service criticality and business SLAs
- Distinguishing between system uptime and user-perceived availability in distributed architectures
- Implementing synthetic transaction monitoring to simulate end-user workflows and detect functional outages
- Calculating availability over meaningful time windows that align with business operation cycles
- Integrating availability data from multiple monitoring tools into a unified reporting dashboard
- Handling clock skew and time synchronization issues when aggregating logs across global data centers
- Adjusting availability calculations during planned maintenance windows without inflating performance metrics
- Defining escalation thresholds that trigger incident management based on sustained degradation, not just outages
Module 2: High Availability Architecture Design
- Distributing stateless services across multiple availability zones while managing failover latency
- Designing stateful systems with replicated data stores and consensus algorithms (e.g., Raft, Paxos) for fault tolerance
- Selecting active-active vs. active-passive configurations based on cost, complexity, and recovery time requirements
- Implementing health checks that accurately reflect service readiness without causing cascading failures
- Configuring load balancer stickiness and session persistence in multi-region deployments
- Managing DNS TTL values to balance responsiveness during failover with caching efficiency
- Designing retry mechanisms with exponential backoff and jitter to prevent thundering herd problems
- Validating failover procedures through controlled chaos engineering experiments
Module 3: Incident Management and Outage Response
- Establishing clear incident command roles during major outages to avoid decision paralysis
- Using runbooks that distinguish between diagnostic steps and irreversible remediation actions
- Coordinating communication between engineering, operations, and customer support during extended incidents
- Implementing circuit-breaking patterns to isolate failing dependencies and preserve core functionality
- Deciding when to roll back a deployment versus applying a hotfix during an ongoing incident
- Preserving forensic data (logs, metrics, core dumps) before restarting or terminating affected components
- Managing access to production systems during incidents without compromising security controls
- Conducting real-time blameless triage while maintaining audit trails for post-incident review
Module 4: Change Management and Deployment Safety
- Requiring mandatory peer review and automated testing gates before promoting changes to production
- Implementing canary deployments with traffic shifting based on health and performance metrics
- Using feature flags to decouple deployment from release and enable rapid disablement of problematic functionality
- Enforcing deployment blackouts during peak business hours or critical operations
- Validating rollback procedures during staging to ensure they function under real failure conditions
- Tracking configuration drift between environments using infrastructure-as-code diffs
- Requiring pre-change impact assessments that explicitly address availability risks
- Automating pre-deployment checks for capacity headroom and dependency health
Module 5: Dependency and Supply Chain Resilience
- Mapping direct and transitive dependencies to identify single points of failure in third-party services
- Negotiating SLAs with external vendors that include meaningful penalties and exit clauses
- Implementing local caching and fallback modes for critical external APIs with known instability
- Monitoring upstream service health independently of vendor-provided status pages
- Managing software supply chain risks by signing and verifying artifacts in the CI/CD pipeline
- Architecting multi-homing strategies for cloud providers in geographies where regional outages are frequent
- Conducting regular dependency audits to remove unused or unmaintained libraries
- Establishing fallback communication channels when primary collaboration tools fail
Module 6: Capacity Planning and Performance Engineering
- Forecasting capacity needs using historical growth trends and business roadmap inputs
- Conducting load testing with production-like data volumes and access patterns
- Identifying performance bottlenecks through distributed tracing and queue latency analysis
- Setting autoscaling policies that respond to meaningful signals without oscillation
- Right-sizing virtual machines and containers based on actual utilization, not peak observed load
- Managing cold start issues in serverless environments during sudden traffic spikes
- Reserving capacity for critical services in shared environments to prevent resource starvation
- Validating backup and recovery workloads do not overload primary systems during testing
Module 7: Monitoring, Alerting, and Observability
- Defining alerting thresholds based on SLO error budgets rather than arbitrary metric limits
- Reducing alert fatigue by suppressing low-severity alerts during major incidents
- Correlating events across logs, metrics, and traces to identify root causes faster
- Implementing dynamic baselining to detect anomalies in seasonal or variable workloads
- Ensuring monitoring systems themselves are highly available and independently deployed
- Managing retention policies for telemetry data to balance cost and forensic needs
- Using service-level indicators to validate that monitoring reflects actual user experience
- Securing access to observability tools with role-based permissions and audit logging
Module 8: Disaster Recovery and Business Continuity
- Classifying systems by recovery time and point objectives to allocate appropriate DR resources
- Testing full failover to secondary sites with real traffic redirection, not just connectivity checks
- Validating data consistency and integrity after failback from a disaster recovery site
- Storing offline backups in geographically isolated locations with physical access controls
- Documenting manual recovery procedures for systems that cannot be fully automated
- Coordinating DR testing with business units to minimize disruption to live operations
- Maintaining up-to-date contact lists and access credentials for emergency responders
- Reviewing insurance coverage and regulatory obligations related to prolonged outages
Module 9: Governance, Compliance, and Continuous Improvement
- Conducting post-incident reviews with mandatory action item tracking and closure verification
- Aligning availability practices with regulatory requirements (e.g., GDPR, HIPAA, SOX)
- Auditing change logs and access controls to detect policy violations or unauthorized modifications
- Integrating availability KPIs into executive reporting and board-level risk assessments
- Updating runbooks and playbooks based on lessons learned from real incidents
- Revising SLOs and error budgets in response to changing business priorities
- Enforcing configuration standards through automated policy-as-code tools
- Rotating on-call responsibilities to prevent burnout while maintaining team readiness