This curriculum spans the technical, operational, and organisational practices found in multi-workshop reliability engineering programs, covering the same depth of decision-making and trade-off analysis used in enterprise incident management, capacity planning, and risk-informed architecture reviews.
Module 1: Defining Availability Requirements and SLIs
- Selecting appropriate Service Level Indicators (SLIs) such as request latency, error rate, or throughput based on user impact and system architecture
- Negotiating SLOs with business stakeholders when system capabilities fall short of desired availability targets
- Differentiating between user-facing and backend service availability requirements in distributed systems
- Implementing synthetic monitoring to simulate user journeys and measure end-to-end availability
- Defining error budgets that balance innovation velocity with system stability
- Handling conflicting availability requirements across geographies due to regional compliance or infrastructure limitations
- Deciding when to exclude specific outages (e.g., planned maintenance) from SLO calculations
- Instrumenting edge cases such as partial failures in multi-region deployments
Module 2: Architecting for High Availability
- Choosing between active-active and active-passive configurations based on data consistency and failover recovery needs
- Designing stateless services to enable horizontal scaling and reduce single points of failure
- Implementing retry logic with exponential backoff and jitter in service-to-service communication
- Introducing circuit breakers to prevent cascading failures during downstream service degradation
- Deciding on data replication strategies (synchronous vs. asynchronous) across availability zones
- Configuring load balancer health checks to accurately reflect backend service readiness
- Selecting appropriate quorum models in consensus-based systems like etcd or ZooKeeper
- Validating failover automation through controlled chaos engineering experiments
Module 3: Monitoring and Observability Implementation
- Defining critical path metrics that directly correlate with user-visible availability
- Configuring alerting thresholds to minimize false positives while ensuring timely incident detection
- Correlating logs, metrics, and traces to identify root causes during partial outages
- Reducing cardinality in time-series databases to maintain query performance at scale
- Implementing structured logging with consistent schema across microservices
- Setting up distributed tracing with context propagation across service boundaries
- Managing retention policies for observability data based on legal, debugging, and cost constraints
- Integrating third-party SaaS monitoring tools without introducing new failure points
Module 4: Incident Response and Outage Management
- Establishing on-call rotations with escalation paths and clear ownership per service
- Documenting runbooks that include diagnostic steps, known failure modes, and rollback procedures
- Conducting blameless postmortems to capture systemic issues rather than individual errors
- Implementing incident communication protocols for internal teams and external customers
- Automating common remediation actions such as pod restarts or traffic rerouting
- Managing alert fatigue by consolidating related signals and tuning notification channels
- Using incident timelines to reconstruct sequence of events during complex failures
- Integrating incident data into training programs to improve team preparedness
Module 5: Capacity Planning and Performance Scaling
- Forecasting resource demand using historical traffic patterns and business growth projections
- Right-sizing compute instances to balance cost, performance, and availability
- Implementing autoscaling policies based on custom metrics beyond CPU and memory
- Simulating traffic spikes using load testing to validate scaling behavior
- Managing cold start issues in serverless environments during sudden traffic surges
- Planning for regional capacity limits in cloud provider environments
- Allocating buffer capacity to absorb unexpected load without triggering failover
- Coordinating scaling strategies across interdependent services to prevent bottlenecks
Module 6: Disaster Recovery and Backup Strategies
- Defining Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for critical systems
- Testing full-stack disaster recovery procedures in isolated environments without production impact
- Automating backup validation to ensure data integrity and restore feasibility
- Encrypting backups and managing key lifecycle in alignment with security policies
- Storing backups in geographically separate regions to mitigate regional outages
- Orchestrating data resynchronization after failback to primary region
- Handling dependencies on external systems during recovery when those systems are also down
- Documenting manual intervention steps required when automated recovery fails
Module 7: Change Management and Deployment Safety
- Implementing canary deployments with progressive traffic shifting based on health signals
- Using feature flags to decouple deployment from release and enable rapid rollback
- Enforcing deployment freeze windows during critical business periods
- Requiring pre-deployment checks such as performance benchmarks and SLO validation
- Tracking change velocity and correlating deployments with incident frequency
- Managing configuration drift between environments through infrastructure-as-code
- Coordinating cross-team deployments to prevent unintended interactions
- Automating rollback triggers based on error rate or latency thresholds
Module 8: Governance, Compliance, and Risk Management
- Aligning availability controls with regulatory requirements such as HIPAA or GDPR
- Documenting risk acceptance decisions when technical constraints prevent meeting SLOs
- Conducting third-party audits of cloud provider SLAs and operational practices
- Managing access controls for production systems to reduce human error risk
- Reporting availability metrics to executive leadership and board-level risk committees
- Integrating availability risk into enterprise-wide risk assessment frameworks
- Ensuring business continuity plans include IT availability components
- Updating policies in response to postmortem findings and evolving threat models
Module 9: Cost-Availability Trade-offs and Optimization
- Evaluating cost of redundancy (e.g., multi-region deployment) against probability of regional outage
- Right-sizing monitoring and logging retention to meet debugging needs without overspending
- Using spot instances or preemptible VMs for non-critical workloads with fallback mechanisms
- Quantifying cost of downtime to justify investment in high-availability infrastructure
- Optimizing CDN usage to improve availability while minimizing egress fees
- Balancing investment in automation against operational toil reduction
- Negotiating custom SLAs with cloud providers based on actual usage patterns
- Implementing cost-aware autoscaling that considers both performance and spend