Description

This curriculum spans the design, implementation, and governance of high-availability systems with the same technical rigor and cross-functional coordination required in multi-phase SRE engagements and enterprise resilience programs.

Module 1: Defining Availability Requirements and SLIs

Selecting service-level indicators (SLIs) based on user-impacting metrics such as request success rate, latency percentiles, and task completion rate.
Negotiating SLI definitions with product and SRE teams to ensure observability instrumentation supports measurement.
Determining measurement windows (e.g., rolling 28-day) and error budget policies aligned with business cycles.
Mapping customer journeys to backend dependencies to identify critical paths for availability monitoring.
Setting thresholds for degraded vs. failed states in multi-tier services with cascading dependencies.
Documenting assumptions behind SLI calculations to enable auditability during incident reviews.
Implementing synthetic transactions to measure availability for non-request-driven components like batch processors.
Handling edge cases such as partial responses, retries, and idempotency in SLI computation logic.

Module 2: Architecting for Fault Tolerance and Redundancy

Designing stateless services with horizontal scaling to eliminate single points of failure at the instance level.
Implementing active-active replication patterns across availability zones for stateful systems with conflict resolution strategies.
Selecting consensus algorithms (e.g., Raft, Paxos) for distributed coordination based on quorum requirements and latency tolerance.
Configuring load balancer health checks to detect and route around unhealthy instances without false positives.
Deploying redundant control plane components (e.g., API gateways, service meshes) with independent failure domains.
Designing fallback mechanisms for external dependencies using circuit breakers and cached responses.
Evaluating trade-offs between synchronous replication (strong consistency) and asynchronous (higher availability) for databases.
Implementing anti-affinity rules in orchestration platforms to prevent co-location of replicas on shared infrastructure.

Module 3: Deployment Strategies for Zero-Downtime Upgrades

Configuring blue-green deployments with traffic switching at the load balancer level and validation hooks.
Implementing canary rollouts with automated rollback triggers based on error rate and latency degradation.
Managing database schema changes using versioned migrations and backward-compatible data models.
Coordinating deployment windows across interdependent microservices to prevent interface mismatches.
Using feature flags to decouple deployment from release and enable runtime control of new functionality.
Validating deployment safety through pre-prod staging environments that mirror production topology.
Instrumenting deployment pipelines with availability checks to prevent promotion during ongoing incidents.
Handling stateful workloads during rolling updates using ordered pod management and persistent volume retention.

Module 4: Monitoring, Alerting, and Incident Detection

Defining alerting thresholds using SLO burn rate calculations to prioritize actionable incidents.
Reducing alert fatigue by suppressing non-actionable alerts during known maintenance windows.
Correlating metrics, logs, and traces to detect systemic issues before user impact escalates.
Implementing heartbeat monitoring for batch jobs and offline systems with expected execution windows.
Configuring escalation policies with on-call rotations and secondary responders for critical alerts.
Designing dashboard hierarchies that provide operational visibility from global health to component-level detail.
Validating monitoring coverage through periodic synthetic failure injection and alert response drills.
Integrating monitoring systems with incident management platforms for automatic ticket creation and status updates.

Module 5: Disaster Recovery and Cross-Region Resilience

Classifying workloads by recovery time objective (RTO) and recovery point objective (RPO) for tiered DR planning.
Implementing automated failover procedures for DNS and traffic routing during regional outages.
Replicating critical data across regions using managed services with versioned snapshots and integrity checks.
Conducting regular DR drills with documented runbooks and post-exercise validation of system state.
Managing failback procedures to prevent data loss or inconsistency after primary region restoration.
Designing multi-region service meshes with locality-based routing to minimize cross-region latency.
Securing access to DR environments with isolated credentials and audit logging to prevent accidental activation.
Documenting dependencies on region-locked services (e.g., AI accelerators, specific APIs) in DR plans.

Module 6: Change Management and Risk Mitigation

Requiring change advisory board (CAB) review for high-risk changes impacting core availability components.
Enforcing pre-change checks such as backup validation, monitoring readiness, and rollback plan documentation.
Implementing time-based change freezes during peak business periods with emergency override protocols.
Using automated linting and policy engines to block non-compliant infrastructure-as-code changes.
Tracking change velocity and incident correlation to identify teams requiring additional operational maturity support.
Integrating post-implementation reviews into the change lifecycle to update risk profiles and controls.
Requiring canary analysis for all production deployments, even during maintenance windows.
Logging all configuration changes with user attribution and audit trail retention for compliance.

Module 7: Capacity Planning and Scalability Engineering

Forecasting resource demands using historical growth trends, seasonality, and upcoming product launches.
Conducting load testing with production-like traffic patterns to validate autoscaling policies.
Right-sizing compute instances based on utilization telemetry and cost-performance trade-offs.
Implementing predictive autoscaling using machine learning models trained on usage patterns.
Managing cold start risks for serverless functions by configuring provisioned concurrency.
Planning for burst capacity in multi-tenant environments to prevent noisy neighbor degradation.
Monitoring queue depths and backpressure signals in message-driven architectures to detect saturation.
Designing sharding strategies for databases and caches to distribute load and avoid hot keys.

Module 8: Incident Response and Postmortem Culture

Activating incident command structure with defined roles (incident commander, comms lead, tech lead).
Using status pages to communicate outage timelines and mitigation progress to internal and external stakeholders.
Preserving system state (logs, metrics, core dumps) at the time of incident for root cause analysis.
Conducting blameless postmortems with participation from all involved teams and leadership.
Tracking action items from postmortems in a centralized system with ownership and due dates.
Classifying incidents by severity and recurrence to prioritize investment in systemic fixes.
Integrating incident timelines with monitoring data to reconstruct sequences accurately.
Standardizing postmortem templates to ensure consistent documentation of impact, timeline, and contributing factors.

Module 9: Governance, Compliance, and Audit Readiness

Mapping availability controls to regulatory requirements (e.g., HIPAA, SOC 2, GDPR) for audit evidence.
Documenting RTO/RPO commitments in service contracts and ensuring technical alignment.
Implementing role-based access control (RBAC) for production systems with just-in-time privilege elevation.
Conducting periodic control assessments to verify availability mechanisms remain effective.
Archiving incident records, change logs, and postmortems for statutory retention periods.
Aligning availability metrics with financial risk models for business continuity planning.
Requiring third-party vendors to provide uptime reports and incident histories under SLA agreements.
Coordinating with legal and compliance teams on disclosure policies during extended outages.