This curriculum spans the technical and operational rigor of a multi-workshop availability engineering program, covering the same depth of planning, execution, and governance tasks seen in enterprise incident command systems and large-scale cloud infrastructure reviews.
Module 1: Defining Availability Requirements and SLIs
- Selecting service-level indicators (SLIs) that reflect actual user-perceived availability, such as end-to-end request success rate over synthetic transaction latency.
- Negotiating SLI measurement windows (e.g., 5-minute vs. 1-hour rollups) with product teams to balance sensitivity and operational noise.
- Implementing blackbox probing for externally accessible endpoints while accounting for CDN and edge caching effects on availability signals.
- Deciding whether to include partial degradation (e.g., degraded search functionality) as downtime in SLI calculations.
- Calibrating error budget burn rate thresholds that trigger incident response without causing alert fatigue.
- Documenting SLI calculation logic in code (e.g., Prometheus queries) to ensure auditability and consistency across teams.
- Handling third-party dependency outages by defining whether they count against internal availability commitments.
- Mapping business-critical user journeys to technical endpoints to prioritize monitoring coverage.
Module 2: High-Availability Architecture Design
- Choosing between active-passive and active-active failover models based on RTO and RPO requirements for stateful services.
- Designing regional failover strategies that account for DNS TTL limitations and cloud provider load balancer propagation delays.
- Implementing distributed consensus algorithms (e.g., Raft) for metadata coordination in multi-region control planes.
- Selecting quorum configurations in clustered databases to balance consistency, availability, and fault tolerance.
- Validating cross-AZ routing and failover behavior in virtual private clouds through controlled network partition testing.
- Architecting state replication mechanisms (e.g., async vs. sync replication) for session stores under bandwidth and latency constraints.
- Integrating circuit breakers at service mesh level to prevent cascading failures during partial outages.
- Evaluating cost-performance trade-offs of multi-cloud vs. multi-region deployment for critical workloads.
Module 3: Monitoring and Observability Implementation
- Deploying redundant telemetry collectors to prevent monitoring blind spots during infrastructure failures.
- Configuring multi-dimensional alerting rules that correlate metrics, logs, and traces to reduce false positives.
- Setting up heartbeat monitoring for background job processors that do not serve HTTP traffic.
- Instrumenting retry logic in clients to distinguish between transient and permanent failures in error rate calculations.
- Storing and querying high-cardinality labels in time-series databases without degrading query performance.
- Implementing log sampling strategies for high-volume services while preserving debuggability for rare errors.
- Validating alert delivery paths through multiple channels (e.g., PagerDuty, SMS, backup email) during comms outages.
- Using synthetic transactions to simulate user flows that are difficult to monitor via production traffic alone.
Module 4: Incident Response and Failover Execution
- Executing DNS-based failover with pre-warmed endpoints to minimize recovery time during regional outages.
- Validating failover runbooks under degraded conditions, such as partial control plane access.
- Coordinating incident command structure handoffs during extended outages exceeding 12 hours.
- Blocking automated deployments during active incidents to prevent compounding failures.
- Rotating credentials and certificates post-incident to close potential attack vectors exposed during failover.
- Managing communication with external stakeholders using templated status updates without disclosing sensitive architecture details.
- Executing data reconciliation jobs after failback to resolve inconsistencies from async replication.
- Enforcing change freeze windows following major incidents to stabilize the environment.
Module 5: Disaster Recovery Planning and Testing
- Scheduling quarterly DR drills that simulate complete data center loss, including backup power and network egress failure.
- Validating backup integrity by restoring production databases to isolated environments and verifying checksums.
- Measuring actual RTO by timing full application stack recovery from backups, including dependency chains.
- Managing encryption key escrow and access controls for offline backups in air-gapped storage.
- Documenting manual intervention steps required when automated recovery tools are unavailable.
- Testing cross-region IAM role replication and policy synchronization during DR activation.
- Updating DR plans after major architectural changes, such as migration to serverless components.
- Coordinating DR tests with dependent teams to avoid cascading impact on shared services.
Module 6: Change and Configuration Management
- Enforcing canary deployment patterns with automated rollback triggers based on availability metrics.
- Version-controlling infrastructure-as-code (IaC) templates and validating drift detection mechanisms.
- Implementing approval gates for changes during high-risk periods, such as peak traffic seasons.
- Using feature flags with kill switches to disable components without redeploying binaries.
- Scanning configuration files for hardcoded secrets before merging into production pipelines.
- Requiring peer review for changes to load balancer health check configurations due to their impact on traffic routing.
- Archiving deprecated configuration variants to prevent accidental reuse in future deployments.
- Validating configuration templates against schema rules before applying to multi-region environments.
Module 7: Capacity Planning and Scalability
- Projecting resource needs based on historical growth trends and upcoming product launches.
- Setting up predictive autoscaling using ML-driven forecasting models for seasonal traffic patterns.
- Reserving capacity in secondary regions to handle failover workloads without performance degradation.
- Monitoring queue depth and backlog growth in message brokers to anticipate scaling bottlenecks.
- Conducting load tests with production-like data distributions to validate scaling assumptions.
- Right-sizing instance families based on memory-to-CPU ratios observed in profiling data.
- Managing cold start risks in serverless environments by provisioning concurrency limits and pre-warming strategies.
- Tracking dependency saturation points, such as database connection pools, during scaling events.
Module 8: Governance and Compliance in Availability Management
- Aligning availability controls with regulatory requirements (e.g., PCI-DSS, HIPAA) for data access during outages.
- Documenting availability architecture decisions in system risk assessment reports for audit purposes.
- Enforcing retention policies for incident logs and post-mortem records based on compliance mandates.
- Classifying systems by business impact to prioritize availability investments and recovery order.
- Implementing access controls for failover execution tools to meet segregation of duties requirements.
- Reporting availability metrics to executive stakeholders using standardized templates that exclude sensitive details.
- Updating business continuity plans to reflect changes in cloud provider dependencies and third-party services.
- Conducting third-party assessments of vendor SLAs to validate claims about infrastructure redundancy.
Module 9: Post-Incident Analysis and System Hardening
- Conducting blameless post-mortems with mandatory participation from all involved engineering teams.
- Tracking action item completion from incident reviews using integrated project management tools.
- Prioritizing remediation tasks based on recurrence likelihood and potential impact on availability.
- Implementing automated tests that reproduce root cause conditions to prevent regression.
- Updating monitoring dashboards to include signals that would have detected the incident earlier.
- Introducing chaos engineering experiments targeting identified failure modes in staging environments.
- Revising on-call playbooks with new diagnostic steps and escalation paths based on incident findings.
- Measuring mean time to recovery (MTTR) improvements after implementing system hardening changes.