This curriculum spans the design, operation, and governance of highly available systems with the same technical specificity and cross-functional coordination found in multi-workshop reliability engineering programs at large-scale technology organizations.
Module 1: Defining and Measuring System Availability
- Selecting appropriate availability metrics (e.g., uptime percentage, MTBF, MTTR) based on business criticality and service tier agreements.
- Implementing synthetic transaction monitoring to simulate user workflows and detect availability degradation before real users are impacted.
- Designing custom SLIs (Service Level Indicators) that reflect actual user-perceived availability, not just infrastructure health.
- Integrating business telemetry (e.g., transaction volume, API success rate) into availability calculations to avoid misleading uptime statistics.
- Establishing thresholds for degraded vs. failed states in multi-tier applications where partial functionality may still be operational.
- Calibrating monitoring intervals to balance detection speed with false positive rates in high-frequency systems.
- Documenting and socializing the mathematical definitions of availability used in reporting to prevent misinterpretation across teams.
- Handling time zone and calendar considerations when calculating rolling availability windows for global services.
Module 2: High Availability Architecture Design
- Choosing between active-passive and active-active deployment models based on data consistency requirements and failover tolerance.
- Implementing regional redundancy with DNS or global load balancers while managing latency and data sovereignty constraints.
- Designing stateless services to enable seamless failover, or implementing distributed session stores where state must persist.
- Selecting replication strategies (synchronous vs. asynchronous) for databases based on RPO and RTO requirements.
- Architecting cross-AZ (Availability Zone) redundancy with automated failover triggers and health checks.
- Deciding when to use third-party CDN failover mechanisms versus primary origin redundancy.
- Validating failover automation through controlled chaos engineering experiments without impacting production data.
- Managing configuration drift in redundant environments through infrastructure-as-code enforcement.
Module 3: Fault Tolerance and Redundancy Patterns
- Implementing circuit breakers in microservices to prevent cascading failures during downstream outages.
- Designing retry logic with exponential backoff and jitter to avoid thundering herd problems during transient failures.
- Introducing redundancy at the component level (e.g., dual power supplies, multi-homed network interfaces) in on-prem deployments.
- Using queue-based load leveling to decouple components and absorb traffic spikes during partial outages.
- Deploying canary services to test fault tolerance mechanisms in production-like conditions.
- Configuring health checks that accurately reflect service readiness, avoiding false positives due to cached responses.
- Managing failover state in distributed locking systems to prevent split-brain scenarios.
- Implementing graceful degradation paths that disable non-critical features during resource constraints.
Module 4: Disaster Recovery Planning and Execution
- Classifying systems by recovery priority based on business impact analysis (BIA) and RTO/RPO requirements.
- Designing and testing cold, warm, and hot standby environments with documented runbooks for activation.
- Validating backup integrity through periodic restore drills, including full environment recovery.
- Coordinating geographically distributed recovery sites while complying with data residency regulations.
- Automating DNS and traffic routing changes during failover using API-driven control planes.
- Managing stateful data replication across regions with conflict resolution strategies for bidirectional sync.
- Establishing communication protocols for incident command during large-scale outages involving multiple teams.
- Documenting and versioning disaster recovery playbooks with role-specific responsibilities and escalation paths.
Module 5: Monitoring and Alerting for Availability
- Configuring multi-dimensional alerting that correlates infrastructure, application, and business metrics to reduce noise.
- Setting dynamic thresholds for anomaly detection in systems with variable load patterns.
- Implementing alert muting and routing policies based on on-call schedules and incident severity.
- Using golden signals (latency, traffic, errors, saturation) as the foundation for availability dashboards.
- Integrating synthetic and real-user monitoring (RUM) to detect geographic or client-specific outages.
- Designing escalation paths that trigger secondary notifications if initial responders do not acknowledge within SLA.
- Managing alert fatigue by suppressing low-priority alerts during ongoing incidents.
- Validating end-to-end monitoring coverage through red team exercises that simulate specific failure modes.
Module 6: Change Management and Deployment Safety
- Enforcing deployment freezes during critical business periods with automated policy checks in CI/CD pipelines.
- Implementing blue-green or canary deployments to reduce blast radius of faulty releases.
- Requiring pre-deployment health check validations and rollback readiness assessments.
- Using feature flags to decouple deployment from release, enabling immediate disablement during instability.
- Tracking change velocity and correlating deployments with incident spikes to adjust release policies.
- Requiring peer review and approval gates for changes to high-availability components.
- Automating rollback triggers based on real-time error rate or latency thresholds post-deployment.
- Logging and auditing all production changes for post-incident root cause analysis.
Module 7: Incident Response and Outage Management
- Activating incident response protocols with defined roles (incident commander, comms lead, resolver) during outages.
- Using status pages to communicate outage details externally while protecting sensitive operational information.
- Preserving logs, metrics, and configuration states during active incidents for forensic analysis.
- Coordinating cross-team troubleshooting in shared systems with clear ownership boundaries.
- Implementing time-boxed troubleshooting phases to avoid analysis paralysis during critical outages.
- Managing stakeholder communication with regular updates at defined intervals, even if resolution is pending.
- Using war rooms or virtual incident bridges with screen sharing and collaborative documentation.
- Enforcing no-blame post-mortems focused on systemic improvements rather than individual accountability.
Module 8: Availability Governance and Compliance
- Defining availability requirements in service contracts and aligning them with technical capabilities.
- Conducting third-party audits of cloud provider SLAs and their actual historical performance.
- Mapping availability controls to regulatory frameworks (e.g., SOC 2, ISO 27001, HIPAA) where uptime is a compliance factor.
- Establishing board-level reporting on availability KPIs and major incident trends.
- Reviewing and updating availability policies annually or after significant architectural changes.
- Managing vendor risk by assessing the availability posture of critical third-party dependencies.
- Documenting exceptions to availability standards with risk acceptance forms signed by business stakeholders.
- Enforcing configuration compliance through automated drift detection and remediation.
Module 9: Cost-Availability Trade-offs and Optimization
- Evaluating the cost-benefit of additional redundancy layers against the business cost of downtime.
- Negotiating premium support and SLA rebates with cloud providers for mission-critical workloads.
- Right-sizing high-availability components to avoid overprovisioning while maintaining resilience.
- Using reserved instances or savings plans for predictable active-active environments.
- Implementing auto-pausing or standby modes for non-critical systems during off-peak hours.
- Quantifying the financial impact of partial outages versus complete failures to guide investment decisions.
- Comparing managed vs. self-hosted services based on availability requirements and operational overhead.
- Optimizing backup retention policies to balance recovery needs with storage costs.