Description

This curriculum spans the design, deployment, and operational lifecycle of highly available systems, comparable in scope to a multi-workshop reliability engineering program implemented across large-scale cloud environments.

Module 1: Defining High Availability Requirements and SLIs

Selecting appropriate service level indicators (SLIs) such as request latency, error rate, and throughput based on user-facing impact
Negotiating SLOs with product and operations stakeholders to balance business needs with technical feasibility
Mapping user journeys to identify critical paths that require high availability protection
Determining acceptable downtime windows for non-critical subsystems during maintenance
Implementing synthetic monitoring to simulate user behavior for accurate SLI measurement
Configuring error budgets to guide release velocity and incident response priorities
Documenting recovery time objectives (RTO) and recovery point objectives (RPO) for each service tier
Establishing escalation thresholds based on SLO burn rate detection

Module 2: Infrastructure Design for Fault Tolerance

Distributing workloads across multiple availability zones to mitigate zone-level failures
Selecting instance types with appropriate burst capacity and redundancy for stateful services
Designing multi-region architectures for disaster recovery with data replication strategies
Implementing anti-affinity rules to prevent co-location of redundant instances
Configuring resilient storage backends using distributed file systems or managed databases
Validating failover mechanisms through controlled zone evacuation drills
Choosing between active-passive and active-active topologies based on cost and complexity trade-offs
Integrating hardware health checks with orchestration layers for automatic node replacement

Module 3: Automated Deployment and Immutable Infrastructure

Enforcing immutable server patterns using container images or golden AMIs to reduce configuration drift
Implementing blue-green or canary deployments with traffic shifting via service mesh or load balancer
Automating rollback triggers based on SLO violations during deployment windows
Validating deployment artifacts with static analysis and vulnerability scanning in CI pipeline
Managing configuration secrets using encrypted parameter stores with least-privilege access
Synchronizing infrastructure changes across regions using declarative templates
Enforcing deployment freeze policies during high-risk periods using pipeline guards
Instrumenting deployment metadata to correlate incidents with recent changes

Module 4: Resilient Service Communication

Implementing circuit breakers and bulkheads in service clients to prevent cascading failures
Configuring retry budgets with exponential backoff and jitter for transient errors
Enforcing service-to-service authentication using mTLS in a service mesh
Setting timeout budgets that align with upstream and downstream SLOs
Routing traffic around degraded services using health-aware load balancing
Managing DNS TTL values to balance caching efficiency with failover responsiveness
Implementing graceful degradation of non-essential features during partial outages
Monitoring and alerting on increased latency or error rates in inter-service calls

Module 5: Data Consistency and Replication

Selecting replication topology (synchronous vs asynchronous) based on RPO and latency tolerance
Configuring quorum-based consensus in distributed databases for write availability
Handling split-brain scenarios with automated fencing and leader election
Maintaining referential integrity across sharded databases during failover
Implementing change data capture for cross-region data synchronization
Validating data consistency using checksums and reconciliation jobs
Planning for backward and forward compatibility in schema migrations
Testing backup restoration procedures with point-in-time recovery

Module 6: Monitoring, Alerting, and Observability

Defining actionable alerts based on symptoms (e.g., latency, errors) rather than causes
Reducing alert fatigue by grouping related signals and setting proper thresholds
Instrumenting distributed traces to diagnose latency bottlenecks across services
Correlating logs, metrics, and traces using unique request identifiers
Setting up golden signal dashboards for real-time service health visibility
Automating alert routing to on-call engineers using escalation policies
Validating monitoring coverage through fault injection and chaos engineering
Archiving telemetry data according to retention policies and compliance requirements

Module 7: Incident Response and Postmortem Culture

Activating incident response protocols with defined roles (incident commander, comms lead)
Using communication bridges and status pages to coordinate internal and external updates
Executing predefined runbooks for common failure scenarios
Preserving system state and logs during incidents for forensic analysis
Conducting blameless postmortems with root cause analysis and action item tracking
Integrating postmortem findings into reliability improvements and training materials
Testing incident response readiness through simulated outage drills
Managing executive and stakeholder communication during major incidents

Module 8: Capacity Planning and Load Management

Forecasting resource demand using historical growth trends and business projections
Implementing auto-scaling policies based on queue depth, CPU, or custom metrics
Setting scaling limits to prevent runaway costs during traffic spikes
Simulating traffic surges using load testing tools to validate scaling behavior
Implementing rate limiting and quota enforcement at API gateways
Using priority queuing to protect core functionality during overload
Pre-warming infrastructure ahead of scheduled high-traffic events
Monitoring for resource exhaustion in shared pools (e.g., database connections)

Module 9: Security and Compliance in High Availability Systems

Integrating security patching into automated deployment pipelines without downtime
Enforcing encryption at rest and in transit across all data tiers
Implementing audit logging with immutable storage for compliance verification
Designing secure cross-region data transfer to meet data sovereignty requirements
Validating failover configurations against security policy enforcement points
Conducting penetration testing on disaster recovery environments
Managing access keys and certificates with automated rotation and revocation
Aligning backup retention schedules with regulatory data preservation mandates