This curriculum spans the design, operation, and governance of highly available systems, comparable in scope to a multi-workshop reliability engineering program embedded within an enterprise SRE or platform team’s operational lifecycle.
Module 1: Defining and Measuring System Availability
- Selecting appropriate availability metrics (e.g., uptime percentage, MTBF, MTTR) based on system criticality and business SLAs
- Implementing time-based vs. event-based measurement windows to align with operational reporting cycles
- Instrumenting systems to capture downtime start/end times with synchronized clocks across distributed components
- Deciding whether to include planned maintenance in availability calculations based on contractual obligations
- Handling edge cases such as partial outages or degraded performance in availability reporting
- Integrating telemetry from third-party services into internal availability dashboards with latency and reliability constraints
- Establishing ownership for data collection accuracy across infrastructure, application, and SRE teams
- Designing audit trails for availability data to support compliance and post-incident reviews
Module 2: Architecture for High Availability
- Choosing between active-passive and active-active failover models based on data consistency and recovery time requirements
- Implementing multi-region deployment strategies with DNS failover or global load balancers
- Designing stateless services to enable horizontal scaling and seamless instance replacement
- Managing shared state across regions using distributed databases with tunable consistency models
- Validating failover automation through controlled chaos engineering experiments
- Allocating capacity buffers in secondary regions to handle traffic spikes during failover
- Configuring health checks that accurately reflect service readiness without introducing false positives
- Documenting recovery topology dependencies to prevent cascading failures during outages
Module 3: Redundancy and Failover Planning
- Selecting redundancy levels (N+1, 2N, etc.) based on risk tolerance and cost-benefit analysis
- Implementing automated failover triggers with configurable thresholds and escalation policies
- Testing failover procedures without disrupting live traffic using shadow routing or canary environments
- Managing failback processes with data resynchronization and consistency validation steps
- Coordinating failover execution across teams during cross-domain outages (e.g., network, storage, compute)
- Handling split-brain scenarios in distributed systems with quorum-based decision making
- Documenting manual override procedures for automated failover systems during edge-case failures
- Integrating failover status into incident management workflows and communication channels
Module 4: Monitoring and Alerting for Availability
- Designing synthetic transaction monitors that simulate critical user workflows end-to-end
- Setting alert thresholds that balance sensitivity with operational noise reduction
- Correlating alerts across layers (infrastructure, application, network) to identify root causes faster
- Implementing alert muting and routing policies during planned maintenance windows
- Validating monitoring coverage for all critical paths in complex microservices architectures
- Ensuring monitoring systems themselves are highly available and self-monitoring
- Integrating third-party API health into internal alerting with fallback detection mechanisms
- Archiving alert history for trend analysis and regulatory compliance
Module 5: Incident Response and Recovery
- Defining escalation paths with clear role assignments during availability incidents
- Executing predefined runbooks while adapting to novel failure modes not covered in documentation
- Coordinating communication between technical teams, management, and external stakeholders during outages
- Deciding when to roll back changes versus pursuing remediation in production
- Preserving system state and logs during recovery for forensic analysis
- Managing access controls to production systems during emergency response to prevent unauthorized changes
- Conducting real-time impact assessment to prioritize recovery efforts based on business criticality
- Integrating external vendor support into incident workflows with defined SLAs and contact protocols
Module 6: Change and Maintenance Management
- Scheduling maintenance windows to minimize impact on peak business operations across time zones
- Implementing change advisory board (CAB) processes with risk-based approval tiers
- Requiring pre-change health checks and post-change validation in deployment pipelines
- Managing dependencies between interdependent services during coordinated upgrades
- Handling emergency changes with accelerated approval while maintaining auditability
- Enforcing deployment freeze periods during critical business events (e.g., Black Friday, fiscal close)
- Tracking rollback success rates to identify systemic deployment reliability issues
- Integrating change data into availability reports to correlate outages with recent modifications
Module 7: Capacity and Performance Management
- Forecasting capacity needs based on historical growth trends and upcoming business initiatives
- Setting performance baselines for key transactions to detect degradation before failure
- Implementing auto-scaling policies with cooldown periods to prevent thrashing
- Conducting load testing under realistic traffic patterns to validate scaling behavior
- Managing resource contention in shared environments (e.g., Kubernetes clusters, VM hosts)
- Planning for burst capacity during seasonal peaks with spot or preemptible instances
- Monitoring queue depths and thread pools to detect impending resource exhaustion
- Right-sizing instance types based on actual utilization versus provisioned capacity
Module 8: Dependency and Supply Chain Risk
- Mapping third-party service dependencies and assessing their availability commitments
- Implementing circuit breakers and fallback mechanisms for external API dependencies
- Validating failover capabilities for cloud provider regions with geographic risk exposure
- Assessing vendor lock-in implications when designing for multi-cloud availability
- Requiring SLAs and penalties from critical vendors with measurable enforcement mechanisms
- Monitoring upstream provider status pages and integrating alerts into internal systems
- Conducting business impact analysis for single points of failure in the supply chain
- Storing critical vendor credentials and support contracts in secure, accessible locations
Module 9: Governance and Continuous Improvement
- Establishing availability targets (SLOs) with business units based on revenue and reputation impact
- Conducting blameless postmortems with actionable follow-up items and ownership assignments
- Tracking reliability debt alongside technical debt in portfolio planning
- Reviewing availability reports quarterly with executive stakeholders to adjust priorities
- Aligning availability investments with risk appetite defined in enterprise risk management
- Updating runbooks and documentation after every incident to reflect real-world conditions
- Integrating availability metrics into vendor performance evaluations and contract renewals
- Rotating on-call responsibilities to maintain team resilience and knowledge distribution