Description

This curriculum spans the design, operation, and governance of highly available systems, equivalent in scope to a multi-workshop program embedded within an enterprise reliability engineering initiative, covering technical implementation, cross-team coordination, and compliance alignment across the full incident lifecycle.

Module 1: Defining and Measuring System Availability

Selecting appropriate availability metrics (e.g., uptime percentage, MTBF, MTTR) based on service criticality and business SLAs
Implementing synthetic transaction monitoring to simulate user workflows and detect degradation before real users are impacted
Configuring time windows for scheduled maintenance without violating contractual uptime obligations
Calibrating monitoring thresholds to balance sensitivity with operational noise and alert fatigue
Integrating business transaction data into availability calculations for customer-impacting outages
Designing data collection pipelines that aggregate logs, metrics, and traces for consistent availability reporting across hybrid environments
Handling clock skew and time synchronization across distributed systems when calculating outage durations
Establishing baselines for normal behavior to detect anomalies in availability patterns

Module 2: Architecting for Resilience and Fault Tolerance

Choosing between active-active and active-passive deployment topologies based on cost, data consistency, and recovery time requirements
Implementing circuit breakers and bulkheads in microservices to prevent cascading failures during partial outages
Designing retry logic with exponential backoff and jitter to avoid thundering herd problems during transient failures
Selecting consensus algorithms (e.g., Raft, Paxos) for distributed coordination systems based on quorum requirements and failure modes
Configuring data replication strategies (synchronous vs. asynchronous) across regions to balance consistency and availability
Validating failover automation through regular, unannounced drills to ensure readiness without production risk
Implementing health checks that reflect actual service capability, not just process liveness
Designing stateless services where possible to simplify recovery and scaling during disruptions

Module 3: Incident Detection and Alerting Strategy

Classifying alerts by severity and impact to route them to appropriate on-call personnel and avoid escalation fatigue
Integrating observability tools with incident management platforms to auto-create and enrich incident tickets
Defining service-specific SLOs and error budgets to trigger alerts based on reliability erosion, not just thresholds
Filtering false positives by correlating alerts across layers (infrastructure, application, business logic)
Implementing dynamic thresholds using historical data to adapt to usage patterns and reduce noise
Ensuring alerting coverage for third-party dependencies with limited observability access
Documenting alert ownership and runbooks at creation to prevent ambiguity during incidents
Testing alert delivery paths (SMS, email, push) regularly to verify reliability

Module 4: Incident Response and Coordination

Assigning and rotating incident commander roles to maintain clear leadership during complex outages
Using communication templates to standardize status updates for internal teams and external stakeholders
Isolating compromised systems during incidents without exacerbating availability issues
Deciding when to roll back deployments versus applying hotfixes based on root cause analysis speed and risk
Coordinating cross-team responses when outages span multiple service boundaries and ownership domains
Maintaining a real-time incident timeline to support post-mortem analysis and regulatory requirements
Enforcing communication protocols to prevent information silos during high-pressure events
Managing external communications during public-facing outages while preserving technical investigation integrity

Module 5: Root Cause Analysis and Post-Incident Review

Conducting blameless post-mortems that focus on systemic factors, not individual errors
Using structured analysis methods (e.g., 5 Whys, Fishbone) to uncover latent conditions contributing to outages
Documenting decision points during incidents to evaluate response effectiveness and identify gaps
Classifying incident types (e.g., deployment, configuration, capacity, dependency) to prioritize remediation efforts
Tracking action items from post-mortems in a centralized system with ownership and deadlines
Sharing post-mortem findings across engineering teams to prevent recurrence of similar failure modes
Integrating RCA findings into change advisory board (CAB) reviews for high-risk modifications
Validating fixes through targeted testing before closing incident follow-up items

Module 6: Change and Configuration Management

Implementing canary deployments with progressive traffic shifts to detect issues before full rollout
Enforcing mandatory peer review and automated checks for infrastructure-as-code changes
Managing configuration drift in long-running systems through automated reconciliation
Using feature flags to decouple deployment from release, enabling runtime control of functionality
Assessing change risk based on service criticality, change scope, and historical failure patterns
Requiring rollback plans and pre-tested recovery procedures for all production changes
Logging and auditing all configuration changes with user attribution and timestamp accuracy
Restricting direct production access and enforcing changes through CI/CD pipelines

Module 7: Dependency and Supply Chain Risk

Mapping direct and transitive dependencies to assess blast radius during third-party outages
Implementing fallback mechanisms or cached responses for non-critical external APIs
Monitoring upstream provider SLAs and performance trends to anticipate degradation
Requiring contractual commitments for incident communication and resolution timelines from vendors
Conducting due diligence on open-source libraries for maintenance activity and security posture
Isolating high-risk dependencies in sandboxed environments or separate execution contexts
Establishing internal mirrors or caches for critical software artifacts to reduce external reliance
Testing failover procedures for cloud provider regions during multi-region dependency failures

Module 8: Capacity Planning and Scalability

Forecasting resource demand based on historical growth, seasonality, and product roadmap
Designing auto-scaling policies that respond to meaningful metrics (e.g., request latency, queue depth)
Conducting load testing under realistic traffic patterns to validate scalability assumptions
Identifying and eliminating single points of failure in scaling infrastructure (e.g., database connection limits)
Right-sizing cloud instances based on actual utilization, not peak theoretical demand
Planning for sudden traffic spikes due to marketing campaigns or viral events
Monitoring queue backlogs and saturation indicators to detect impending capacity exhaustion
Implementing graceful degradation strategies when capacity limits are reached

Module 9: Governance, Compliance, and Audit

Aligning availability controls with regulatory requirements (e.g., HIPAA, GDPR, PCI-DSS) for data access during outages
Documenting availability controls and incident response procedures for external audits
Implementing role-based access controls for production systems to meet segregation of duties requirements
Retaining incident logs and communications for legally mandated retention periods
Conducting regular internal reviews of availability practices against industry standards (e.g., NIST, ISO 27001)
Reporting availability metrics to executive leadership and board members with context on risk exposure
Integrating availability risk into enterprise risk management frameworks
Ensuring third-party providers undergo independent audits (e.g., SOC 2) relevant to service continuity