Description

This curriculum spans the design, operation, and governance of highly available systems, comparable in scope to a multi-workshop reliability engineering program embedded within an enterprise SRE or platform team’s operational lifecycle.

Module 1: Defining and Measuring System Availability

Selecting appropriate availability metrics (e.g., uptime percentage, MTBF, MTTR) based on system criticality and business SLAs
Implementing time-based vs. event-based measurement windows to align with operational reporting cycles
Instrumenting systems to capture downtime start/end times with synchronized clocks across distributed components
Deciding whether to include planned maintenance in availability calculations based on contractual obligations
Handling edge cases such as partial outages or degraded performance in availability reporting
Integrating telemetry from third-party services into internal availability dashboards with latency and reliability constraints
Establishing ownership for data collection accuracy across infrastructure, application, and SRE teams
Designing audit trails for availability data to support compliance and post-incident reviews

Module 2: Architecture for High Availability

Choosing between active-passive and active-active failover models based on data consistency and recovery time requirements
Implementing multi-region deployment strategies with DNS failover or global load balancers
Designing stateless services to enable horizontal scaling and seamless instance replacement
Managing shared state across regions using distributed databases with tunable consistency models
Validating failover automation through controlled chaos engineering experiments
Allocating capacity buffers in secondary regions to handle traffic spikes during failover
Configuring health checks that accurately reflect service readiness without introducing false positives
Documenting recovery topology dependencies to prevent cascading failures during outages

Module 3: Redundancy and Failover Planning

Selecting redundancy levels (N+1, 2N, etc.) based on risk tolerance and cost-benefit analysis
Implementing automated failover triggers with configurable thresholds and escalation policies
Testing failover procedures without disrupting live traffic using shadow routing or canary environments
Managing failback processes with data resynchronization and consistency validation steps
Coordinating failover execution across teams during cross-domain outages (e.g., network, storage, compute)
Handling split-brain scenarios in distributed systems with quorum-based decision making
Documenting manual override procedures for automated failover systems during edge-case failures
Integrating failover status into incident management workflows and communication channels

Module 4: Monitoring and Alerting for Availability

Designing synthetic transaction monitors that simulate critical user workflows end-to-end
Setting alert thresholds that balance sensitivity with operational noise reduction
Correlating alerts across layers (infrastructure, application, network) to identify root causes faster
Implementing alert muting and routing policies during planned maintenance windows
Validating monitoring coverage for all critical paths in complex microservices architectures
Ensuring monitoring systems themselves are highly available and self-monitoring
Integrating third-party API health into internal alerting with fallback detection mechanisms
Archiving alert history for trend analysis and regulatory compliance

Module 5: Incident Response and Recovery

Defining escalation paths with clear role assignments during availability incidents
Executing predefined runbooks while adapting to novel failure modes not covered in documentation
Coordinating communication between technical teams, management, and external stakeholders during outages
Deciding when to roll back changes versus pursuing remediation in production
Preserving system state and logs during recovery for forensic analysis
Managing access controls to production systems during emergency response to prevent unauthorized changes
Conducting real-time impact assessment to prioritize recovery efforts based on business criticality
Integrating external vendor support into incident workflows with defined SLAs and contact protocols

Module 6: Change and Maintenance Management

Scheduling maintenance windows to minimize impact on peak business operations across time zones
Implementing change advisory board (CAB) processes with risk-based approval tiers
Requiring pre-change health checks and post-change validation in deployment pipelines
Managing dependencies between interdependent services during coordinated upgrades
Handling emergency changes with accelerated approval while maintaining auditability
Enforcing deployment freeze periods during critical business events (e.g., Black Friday, fiscal close)
Tracking rollback success rates to identify systemic deployment reliability issues
Integrating change data into availability reports to correlate outages with recent modifications

Module 7: Capacity and Performance Management

Forecasting capacity needs based on historical growth trends and upcoming business initiatives
Setting performance baselines for key transactions to detect degradation before failure
Implementing auto-scaling policies with cooldown periods to prevent thrashing
Conducting load testing under realistic traffic patterns to validate scaling behavior
Managing resource contention in shared environments (e.g., Kubernetes clusters, VM hosts)
Planning for burst capacity during seasonal peaks with spot or preemptible instances
Monitoring queue depths and thread pools to detect impending resource exhaustion
Right-sizing instance types based on actual utilization versus provisioned capacity

Module 8: Dependency and Supply Chain Risk

Mapping third-party service dependencies and assessing their availability commitments
Implementing circuit breakers and fallback mechanisms for external API dependencies
Validating failover capabilities for cloud provider regions with geographic risk exposure
Assessing vendor lock-in implications when designing for multi-cloud availability
Requiring SLAs and penalties from critical vendors with measurable enforcement mechanisms
Monitoring upstream provider status pages and integrating alerts into internal systems
Conducting business impact analysis for single points of failure in the supply chain
Storing critical vendor credentials and support contracts in secure, accessible locations

Module 9: Governance and Continuous Improvement

Establishing availability targets (SLOs) with business units based on revenue and reputation impact
Conducting blameless postmortems with actionable follow-up items and ownership assignments
Tracking reliability debt alongside technical debt in portfolio planning
Reviewing availability reports quarterly with executive stakeholders to adjust priorities
Aligning availability investments with risk appetite defined in enterprise risk management
Updating runbooks and documentation after every incident to reflect real-world conditions
Integrating availability metrics into vendor performance evaluations and contract renewals
Rotating on-call responsibilities to maintain team resilience and knowledge distribution