Description

This curriculum spans the technical and operational rigor of a multi-workshop resilience engineering program, addressing fault tolerance across distributed systems, infrastructure, deployment, monitoring, disaster recovery, and cross-team incident management in complex DevOps environments.

Module 1: Foundations of Fault Tolerance in Distributed Systems

Selecting between active-active and active-passive deployment topologies based on recovery time objectives and infrastructure cost constraints.
Implementing health checks that distinguish between transient and permanent failures to avoid cascading restarts.
Designing retry mechanisms with exponential backoff and jitter to prevent thundering herd problems during service outages.
Choosing appropriate circuit breaker thresholds (failure count, timeout duration) based on service response time SLAs.
Integrating distributed tracing to correlate fault events across microservices and identify root causes during partial outages.
Deciding on stateless versus stateful service design when evaluating recovery complexity after node failures.

Module 2: Resilient Infrastructure Design

Configuring multi-AZ deployments in cloud environments with automated failover for critical database instances.
Implementing immutable infrastructure patterns to eliminate configuration drift during node replacement.
Selecting instance types and families that support live migration or host redundancy in virtualized environments.
Designing network topologies with redundant routing paths and avoiding single points of failure in load balancer placement.
Automating host-level recovery using watchdog processes that trigger instance replacement after kernel panics.
Managing shared storage dependencies that introduce coupling during infrastructure failover scenarios.

Module 3: Continuous Deployment with Zero Downtime

Implementing blue-green deployments with traffic shifting via DNS or load balancer rules to minimize rollback risk.
Using canary rollouts with automated metric validation to detect regressions before full release.
Coordinating database schema migrations that maintain backward compatibility during dual-version runtime.
Designing deployment pipelines that pause on threshold breaches in error rate or latency during rollout.
Managing configuration drift between environments by enforcing infrastructure-as-code compliance in staging and production.
Handling long-running transactions during deployment windows to avoid data inconsistency on rollback.

Module 4: Automated Monitoring and Alerting

Defining SLO-based error budgets to determine when to trigger alerts versus tolerate transient degradation.
Configuring synthetic transactions to detect end-to-end service failures before user impact.
Filtering alert noise by correlating related incidents and suppressing lower-tier alerts during cascading failures.
Setting up log sampling strategies for high-volume services to balance observability and cost.
Integrating alert routing with on-call schedules and escalation policies in incident management systems.
Validating monitoring coverage during deployments by comparing pre- and post-release metric availability.

Module 5: Disaster Recovery and Backup Strategies

Establishing RPO and RTO targets for each service tier and aligning backup frequency accordingly.
Testing cross-region failover procedures with controlled DNS TTL reductions and traffic rerouting.
Encrypting backups at rest and managing key rotation in alignment with compliance requirements.
Validating backup integrity through periodic restore drills in isolated environments.
Documenting manual recovery steps for systems that lack full automation due to legacy constraints.
Managing dependencies on third-party services during disaster scenarios where external APIs may also be degraded.

Module 6: Chaos Engineering and Proactive Resilience Testing

Scheduling chaos experiments during low-traffic windows to minimize business impact.
Injecting network latency and packet loss at the container or host level to simulate regional outages.
Automating rollback of chaos experiments if predefined safety thresholds are violated.
Integrating chaos testing into CI/CD pipelines for stateful services with ephemeral environments.
Measuring blast radius by isolating test scopes to specific service instances or availability zones.
Documenting failure modes discovered during experiments to update runbooks and architecture diagrams.

Module 7: Governance and Operational Trade-offs

Approving exceptions to fault tolerance standards for low-risk internal tools based on cost-benefit analysis.
Requiring resilience design reviews for new services that integrate with critical business workflows.
Enforcing tagging and metadata standards to track ownership and recovery priority during incidents.
Allocating budget for redundancy features based on service criticality and historical incident data.
Revising incident postmortems to update fault tolerance controls and prevent recurrence.
Managing technical debt in legacy systems by prioritizing incremental resilience improvements over full rewrites.

Module 8: Cross-Team Coordination and Incident Response

Establishing communication protocols for incident commanders during multi-team outages.
Standardizing runbook formats to ensure clarity and actionability during high-stress events.
Conducting blameless postmortems with participation from development, operations, and product teams.
Integrating third-party vendor response SLAs into incident escalation paths for external dependencies.
Rotating on-call responsibilities across team members to distribute cognitive load and build shared expertise.
Simulating large-scale outages with tabletop exercises to validate coordination and decision-making under pressure.