Description

This curriculum spans the technical, operational, and organisational practices found in multi-workshop reliability engineering programs, covering the same depth of decision-making and trade-off analysis used in enterprise incident management, capacity planning, and risk-informed architecture reviews.

Module 1: Defining Availability Requirements and SLIs

Selecting appropriate Service Level Indicators (SLIs) such as request latency, error rate, or throughput based on user impact and system architecture
Negotiating SLOs with business stakeholders when system capabilities fall short of desired availability targets
Differentiating between user-facing and backend service availability requirements in distributed systems
Implementing synthetic monitoring to simulate user journeys and measure end-to-end availability
Defining error budgets that balance innovation velocity with system stability
Handling conflicting availability requirements across geographies due to regional compliance or infrastructure limitations
Deciding when to exclude specific outages (e.g., planned maintenance) from SLO calculations
Instrumenting edge cases such as partial failures in multi-region deployments

Module 2: Architecting for High Availability

Choosing between active-active and active-passive configurations based on data consistency and failover recovery needs
Designing stateless services to enable horizontal scaling and reduce single points of failure
Implementing retry logic with exponential backoff and jitter in service-to-service communication
Introducing circuit breakers to prevent cascading failures during downstream service degradation
Deciding on data replication strategies (synchronous vs. asynchronous) across availability zones
Configuring load balancer health checks to accurately reflect backend service readiness
Selecting appropriate quorum models in consensus-based systems like etcd or ZooKeeper
Validating failover automation through controlled chaos engineering experiments

Module 3: Monitoring and Observability Implementation

Defining critical path metrics that directly correlate with user-visible availability
Configuring alerting thresholds to minimize false positives while ensuring timely incident detection
Correlating logs, metrics, and traces to identify root causes during partial outages
Reducing cardinality in time-series databases to maintain query performance at scale
Implementing structured logging with consistent schema across microservices
Setting up distributed tracing with context propagation across service boundaries
Managing retention policies for observability data based on legal, debugging, and cost constraints
Integrating third-party SaaS monitoring tools without introducing new failure points

Module 4: Incident Response and Outage Management

Establishing on-call rotations with escalation paths and clear ownership per service
Documenting runbooks that include diagnostic steps, known failure modes, and rollback procedures
Conducting blameless postmortems to capture systemic issues rather than individual errors
Implementing incident communication protocols for internal teams and external customers
Automating common remediation actions such as pod restarts or traffic rerouting
Managing alert fatigue by consolidating related signals and tuning notification channels
Using incident timelines to reconstruct sequence of events during complex failures
Integrating incident data into training programs to improve team preparedness

Module 5: Capacity Planning and Performance Scaling

Forecasting resource demand using historical traffic patterns and business growth projections
Right-sizing compute instances to balance cost, performance, and availability
Implementing autoscaling policies based on custom metrics beyond CPU and memory
Simulating traffic spikes using load testing to validate scaling behavior
Managing cold start issues in serverless environments during sudden traffic surges
Planning for regional capacity limits in cloud provider environments
Allocating buffer capacity to absorb unexpected load without triggering failover
Coordinating scaling strategies across interdependent services to prevent bottlenecks

Module 6: Disaster Recovery and Backup Strategies

Defining Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for critical systems
Testing full-stack disaster recovery procedures in isolated environments without production impact
Automating backup validation to ensure data integrity and restore feasibility
Encrypting backups and managing key lifecycle in alignment with security policies
Storing backups in geographically separate regions to mitigate regional outages
Orchestrating data resynchronization after failback to primary region
Handling dependencies on external systems during recovery when those systems are also down
Documenting manual intervention steps required when automated recovery fails

Module 7: Change Management and Deployment Safety

Implementing canary deployments with progressive traffic shifting based on health signals
Using feature flags to decouple deployment from release and enable rapid rollback
Enforcing deployment freeze windows during critical business periods
Requiring pre-deployment checks such as performance benchmarks and SLO validation
Tracking change velocity and correlating deployments with incident frequency
Managing configuration drift between environments through infrastructure-as-code
Coordinating cross-team deployments to prevent unintended interactions
Automating rollback triggers based on error rate or latency thresholds

Module 8: Governance, Compliance, and Risk Management

Aligning availability controls with regulatory requirements such as HIPAA or GDPR
Documenting risk acceptance decisions when technical constraints prevent meeting SLOs
Conducting third-party audits of cloud provider SLAs and operational practices
Managing access controls for production systems to reduce human error risk
Reporting availability metrics to executive leadership and board-level risk committees
Integrating availability risk into enterprise-wide risk assessment frameworks
Ensuring business continuity plans include IT availability components
Updating policies in response to postmortem findings and evolving threat models

Module 9: Cost-Availability Trade-offs and Optimization

Evaluating cost of redundancy (e.g., multi-region deployment) against probability of regional outage
Right-sizing monitoring and logging retention to meet debugging needs without overspending
Using spot instances or preemptible VMs for non-critical workloads with fallback mechanisms
Quantifying cost of downtime to justify investment in high-availability infrastructure
Optimizing CDN usage to improve availability while minimizing egress fees
Balancing investment in automation against operational toil reduction
Negotiating custom SLAs with cloud providers based on actual usage patterns
Implementing cost-aware autoscaling that considers both performance and spend