This curriculum spans the design, operation, and governance of highly available systems, equivalent in scope to a multi-workshop program embedded within an enterprise reliability engineering initiative, covering technical implementation, cross-team coordination, and compliance alignment across the full incident lifecycle.
Module 1: Defining and Measuring System Availability
- Selecting appropriate availability metrics (e.g., uptime percentage, MTBF, MTTR) based on service criticality and business SLAs
- Implementing synthetic transaction monitoring to simulate user workflows and detect degradation before real users are impacted
- Configuring time windows for scheduled maintenance without violating contractual uptime obligations
- Calibrating monitoring thresholds to balance sensitivity with operational noise and alert fatigue
- Integrating business transaction data into availability calculations for customer-impacting outages
- Designing data collection pipelines that aggregate logs, metrics, and traces for consistent availability reporting across hybrid environments
- Handling clock skew and time synchronization across distributed systems when calculating outage durations
- Establishing baselines for normal behavior to detect anomalies in availability patterns
Module 2: Architecting for Resilience and Fault Tolerance
- Choosing between active-active and active-passive deployment topologies based on cost, data consistency, and recovery time requirements
- Implementing circuit breakers and bulkheads in microservices to prevent cascading failures during partial outages
- Designing retry logic with exponential backoff and jitter to avoid thundering herd problems during transient failures
- Selecting consensus algorithms (e.g., Raft, Paxos) for distributed coordination systems based on quorum requirements and failure modes
- Configuring data replication strategies (synchronous vs. asynchronous) across regions to balance consistency and availability
- Validating failover automation through regular, unannounced drills to ensure readiness without production risk
- Implementing health checks that reflect actual service capability, not just process liveness
- Designing stateless services where possible to simplify recovery and scaling during disruptions
Module 3: Incident Detection and Alerting Strategy
- Classifying alerts by severity and impact to route them to appropriate on-call personnel and avoid escalation fatigue
- Integrating observability tools with incident management platforms to auto-create and enrich incident tickets
- Defining service-specific SLOs and error budgets to trigger alerts based on reliability erosion, not just thresholds
- Filtering false positives by correlating alerts across layers (infrastructure, application, business logic)
- Implementing dynamic thresholds using historical data to adapt to usage patterns and reduce noise
- Ensuring alerting coverage for third-party dependencies with limited observability access
- Documenting alert ownership and runbooks at creation to prevent ambiguity during incidents
- Testing alert delivery paths (SMS, email, push) regularly to verify reliability
Module 4: Incident Response and Coordination
- Assigning and rotating incident commander roles to maintain clear leadership during complex outages
- Using communication templates to standardize status updates for internal teams and external stakeholders
- Isolating compromised systems during incidents without exacerbating availability issues
- Deciding when to roll back deployments versus applying hotfixes based on root cause analysis speed and risk
- Coordinating cross-team responses when outages span multiple service boundaries and ownership domains
- Maintaining a real-time incident timeline to support post-mortem analysis and regulatory requirements
- Enforcing communication protocols to prevent information silos during high-pressure events
- Managing external communications during public-facing outages while preserving technical investigation integrity
Module 5: Root Cause Analysis and Post-Incident Review
- Conducting blameless post-mortems that focus on systemic factors, not individual errors
- Using structured analysis methods (e.g., 5 Whys, Fishbone) to uncover latent conditions contributing to outages
- Documenting decision points during incidents to evaluate response effectiveness and identify gaps
- Classifying incident types (e.g., deployment, configuration, capacity, dependency) to prioritize remediation efforts
- Tracking action items from post-mortems in a centralized system with ownership and deadlines
- Sharing post-mortem findings across engineering teams to prevent recurrence of similar failure modes
- Integrating RCA findings into change advisory board (CAB) reviews for high-risk modifications
- Validating fixes through targeted testing before closing incident follow-up items
Module 6: Change and Configuration Management
- Implementing canary deployments with progressive traffic shifts to detect issues before full rollout
- Enforcing mandatory peer review and automated checks for infrastructure-as-code changes
- Managing configuration drift in long-running systems through automated reconciliation
- Using feature flags to decouple deployment from release, enabling runtime control of functionality
- Assessing change risk based on service criticality, change scope, and historical failure patterns
- Requiring rollback plans and pre-tested recovery procedures for all production changes
- Logging and auditing all configuration changes with user attribution and timestamp accuracy
- Restricting direct production access and enforcing changes through CI/CD pipelines
Module 7: Dependency and Supply Chain Risk
- Mapping direct and transitive dependencies to assess blast radius during third-party outages
- Implementing fallback mechanisms or cached responses for non-critical external APIs
- Monitoring upstream provider SLAs and performance trends to anticipate degradation
- Requiring contractual commitments for incident communication and resolution timelines from vendors
- Conducting due diligence on open-source libraries for maintenance activity and security posture
- Isolating high-risk dependencies in sandboxed environments or separate execution contexts
- Establishing internal mirrors or caches for critical software artifacts to reduce external reliance
- Testing failover procedures for cloud provider regions during multi-region dependency failures
Module 8: Capacity Planning and Scalability
- Forecasting resource demand based on historical growth, seasonality, and product roadmap
- Designing auto-scaling policies that respond to meaningful metrics (e.g., request latency, queue depth)
- Conducting load testing under realistic traffic patterns to validate scalability assumptions
- Identifying and eliminating single points of failure in scaling infrastructure (e.g., database connection limits)
- Right-sizing cloud instances based on actual utilization, not peak theoretical demand
- Planning for sudden traffic spikes due to marketing campaigns or viral events
- Monitoring queue backlogs and saturation indicators to detect impending capacity exhaustion
- Implementing graceful degradation strategies when capacity limits are reached
Module 9: Governance, Compliance, and Audit
- Aligning availability controls with regulatory requirements (e.g., HIPAA, GDPR, PCI-DSS) for data access during outages
- Documenting availability controls and incident response procedures for external audits
- Implementing role-based access controls for production systems to meet segregation of duties requirements
- Retaining incident logs and communications for legally mandated retention periods
- Conducting regular internal reviews of availability practices against industry standards (e.g., NIST, ISO 27001)
- Reporting availability metrics to executive leadership and board members with context on risk exposure
- Integrating availability risk into enterprise risk management frameworks
- Ensuring third-party providers undergo independent audits (e.g., SOC 2) relevant to service continuity