This curriculum spans the technical and operational rigor of a multi-workshop resilience engineering program, addressing fault tolerance across distributed systems, infrastructure, deployment, monitoring, disaster recovery, and cross-team incident management in complex DevOps environments.
Module 1: Foundations of Fault Tolerance in Distributed Systems
- Selecting between active-active and active-passive deployment topologies based on recovery time objectives and infrastructure cost constraints.
- Implementing health checks that distinguish between transient and permanent failures to avoid cascading restarts.
- Designing retry mechanisms with exponential backoff and jitter to prevent thundering herd problems during service outages.
- Choosing appropriate circuit breaker thresholds (failure count, timeout duration) based on service response time SLAs.
- Integrating distributed tracing to correlate fault events across microservices and identify root causes during partial outages.
- Deciding on stateless versus stateful service design when evaluating recovery complexity after node failures.
Module 2: Resilient Infrastructure Design
- Configuring multi-AZ deployments in cloud environments with automated failover for critical database instances.
- Implementing immutable infrastructure patterns to eliminate configuration drift during node replacement.
- Selecting instance types and families that support live migration or host redundancy in virtualized environments.
- Designing network topologies with redundant routing paths and avoiding single points of failure in load balancer placement.
- Automating host-level recovery using watchdog processes that trigger instance replacement after kernel panics.
- Managing shared storage dependencies that introduce coupling during infrastructure failover scenarios.
Module 3: Continuous Deployment with Zero Downtime
- Implementing blue-green deployments with traffic shifting via DNS or load balancer rules to minimize rollback risk.
- Using canary rollouts with automated metric validation to detect regressions before full release.
- Coordinating database schema migrations that maintain backward compatibility during dual-version runtime.
- Designing deployment pipelines that pause on threshold breaches in error rate or latency during rollout.
- Managing configuration drift between environments by enforcing infrastructure-as-code compliance in staging and production.
- Handling long-running transactions during deployment windows to avoid data inconsistency on rollback.
Module 4: Automated Monitoring and Alerting
- Defining SLO-based error budgets to determine when to trigger alerts versus tolerate transient degradation.
- Configuring synthetic transactions to detect end-to-end service failures before user impact.
- Filtering alert noise by correlating related incidents and suppressing lower-tier alerts during cascading failures.
- Setting up log sampling strategies for high-volume services to balance observability and cost.
- Integrating alert routing with on-call schedules and escalation policies in incident management systems.
- Validating monitoring coverage during deployments by comparing pre- and post-release metric availability.
Module 5: Disaster Recovery and Backup Strategies
- Establishing RPO and RTO targets for each service tier and aligning backup frequency accordingly.
- Testing cross-region failover procedures with controlled DNS TTL reductions and traffic rerouting.
- Encrypting backups at rest and managing key rotation in alignment with compliance requirements.
- Validating backup integrity through periodic restore drills in isolated environments.
- Documenting manual recovery steps for systems that lack full automation due to legacy constraints.
- Managing dependencies on third-party services during disaster scenarios where external APIs may also be degraded.
Module 6: Chaos Engineering and Proactive Resilience Testing
- Scheduling chaos experiments during low-traffic windows to minimize business impact.
- Injecting network latency and packet loss at the container or host level to simulate regional outages.
- Automating rollback of chaos experiments if predefined safety thresholds are violated.
- Integrating chaos testing into CI/CD pipelines for stateful services with ephemeral environments.
- Measuring blast radius by isolating test scopes to specific service instances or availability zones.
- Documenting failure modes discovered during experiments to update runbooks and architecture diagrams.
Module 7: Governance and Operational Trade-offs
- Approving exceptions to fault tolerance standards for low-risk internal tools based on cost-benefit analysis.
- Requiring resilience design reviews for new services that integrate with critical business workflows.
- Enforcing tagging and metadata standards to track ownership and recovery priority during incidents.
- Allocating budget for redundancy features based on service criticality and historical incident data.
- Revising incident postmortems to update fault tolerance controls and prevent recurrence.
- Managing technical debt in legacy systems by prioritizing incremental resilience improvements over full rewrites.
Module 8: Cross-Team Coordination and Incident Response
- Establishing communication protocols for incident commanders during multi-team outages.
- Standardizing runbook formats to ensure clarity and actionability during high-stress events.
- Conducting blameless postmortems with participation from development, operations, and product teams.
- Integrating third-party vendor response SLAs into incident escalation paths for external dependencies.
- Rotating on-call responsibilities across team members to distribute cognitive load and build shared expertise.
- Simulating large-scale outages with tabletop exercises to validate coordination and decision-making under pressure.