Description

This curriculum spans the breadth of a multi-workshop operational resilience program, covering the technical, procedural, and governance practices required to maintain service availability across complex, distributed systems.

Module 1: Defining and Measuring Service Availability

Selecting appropriate availability metrics (e.g., uptime percentage, MTBF, MTTR) based on service criticality and business SLAs
Distinguishing between system uptime and user-perceived availability in distributed architectures
Implementing synthetic transaction monitoring to simulate end-user workflows and detect functional outages
Calculating availability over meaningful time windows that align with business operation cycles
Integrating availability data from multiple monitoring tools into a unified reporting dashboard
Handling clock skew and time synchronization issues when aggregating logs across global data centers
Adjusting availability calculations during planned maintenance windows without inflating performance metrics
Defining escalation thresholds that trigger incident management based on sustained degradation, not just outages

Module 2: High Availability Architecture Design

Distributing stateless services across multiple availability zones while managing failover latency
Designing stateful systems with replicated data stores and consensus algorithms (e.g., Raft, Paxos) for fault tolerance
Selecting active-active vs. active-passive configurations based on cost, complexity, and recovery time requirements
Implementing health checks that accurately reflect service readiness without causing cascading failures
Configuring load balancer stickiness and session persistence in multi-region deployments
Managing DNS TTL values to balance responsiveness during failover with caching efficiency
Designing retry mechanisms with exponential backoff and jitter to prevent thundering herd problems
Validating failover procedures through controlled chaos engineering experiments

Module 3: Incident Management and Outage Response

Establishing clear incident command roles during major outages to avoid decision paralysis
Using runbooks that distinguish between diagnostic steps and irreversible remediation actions
Coordinating communication between engineering, operations, and customer support during extended incidents
Implementing circuit-breaking patterns to isolate failing dependencies and preserve core functionality
Deciding when to roll back a deployment versus applying a hotfix during an ongoing incident
Preserving forensic data (logs, metrics, core dumps) before restarting or terminating affected components
Managing access to production systems during incidents without compromising security controls
Conducting real-time blameless triage while maintaining audit trails for post-incident review

Module 4: Change Management and Deployment Safety

Requiring mandatory peer review and automated testing gates before promoting changes to production
Implementing canary deployments with traffic shifting based on health and performance metrics
Using feature flags to decouple deployment from release and enable rapid disablement of problematic functionality
Enforcing deployment blackouts during peak business hours or critical operations
Validating rollback procedures during staging to ensure they function under real failure conditions
Tracking configuration drift between environments using infrastructure-as-code diffs
Requiring pre-change impact assessments that explicitly address availability risks
Automating pre-deployment checks for capacity headroom and dependency health

Module 5: Dependency and Supply Chain Resilience

Mapping direct and transitive dependencies to identify single points of failure in third-party services
Negotiating SLAs with external vendors that include meaningful penalties and exit clauses
Implementing local caching and fallback modes for critical external APIs with known instability
Monitoring upstream service health independently of vendor-provided status pages
Managing software supply chain risks by signing and verifying artifacts in the CI/CD pipeline
Architecting multi-homing strategies for cloud providers in geographies where regional outages are frequent
Conducting regular dependency audits to remove unused or unmaintained libraries
Establishing fallback communication channels when primary collaboration tools fail

Module 6: Capacity Planning and Performance Engineering

Forecasting capacity needs using historical growth trends and business roadmap inputs
Conducting load testing with production-like data volumes and access patterns
Identifying performance bottlenecks through distributed tracing and queue latency analysis
Setting autoscaling policies that respond to meaningful signals without oscillation
Right-sizing virtual machines and containers based on actual utilization, not peak observed load
Managing cold start issues in serverless environments during sudden traffic spikes
Reserving capacity for critical services in shared environments to prevent resource starvation
Validating backup and recovery workloads do not overload primary systems during testing

Module 7: Monitoring, Alerting, and Observability

Defining alerting thresholds based on SLO error budgets rather than arbitrary metric limits
Reducing alert fatigue by suppressing low-severity alerts during major incidents
Correlating events across logs, metrics, and traces to identify root causes faster
Implementing dynamic baselining to detect anomalies in seasonal or variable workloads
Ensuring monitoring systems themselves are highly available and independently deployed
Managing retention policies for telemetry data to balance cost and forensic needs
Using service-level indicators to validate that monitoring reflects actual user experience
Securing access to observability tools with role-based permissions and audit logging

Module 8: Disaster Recovery and Business Continuity

Classifying systems by recovery time and point objectives to allocate appropriate DR resources
Testing full failover to secondary sites with real traffic redirection, not just connectivity checks
Validating data consistency and integrity after failback from a disaster recovery site
Storing offline backups in geographically isolated locations with physical access controls
Documenting manual recovery procedures for systems that cannot be fully automated
Coordinating DR testing with business units to minimize disruption to live operations
Maintaining up-to-date contact lists and access credentials for emergency responders
Reviewing insurance coverage and regulatory obligations related to prolonged outages

Module 9: Governance, Compliance, and Continuous Improvement

Conducting post-incident reviews with mandatory action item tracking and closure verification
Aligning availability practices with regulatory requirements (e.g., GDPR, HIPAA, SOX)
Auditing change logs and access controls to detect policy violations or unauthorized modifications
Integrating availability KPIs into executive reporting and board-level risk assessments
Updating runbooks and playbooks based on lessons learned from real incidents
Revising SLOs and error budgets in response to changing business priorities
Enforcing configuration standards through automated policy-as-code tools
Rotating on-call responsibilities to prevent burnout while maintaining team readiness