This curriculum spans the design, operation, and governance of highly available systems with the same technical rigor and cross-functional coordination found in multi-workshop reliability programs at large-scale technology organizations.
Module 1: Defining and Measuring System Availability
- Select availability targets (e.g., 99.9% vs. 99.99%) based on business impact analysis and cost of downtime per hour.
- Implement synthetic monitoring to simulate user transactions and measure uptime independently of real-user traffic fluctuations.
- Decide whether to exclude scheduled maintenance windows from SLA calculations and document the policy in service contracts.
- Integrate monitoring data from multiple sources (on-prem, cloud, SaaS) into a unified availability dashboard with consistent time alignment.
- Establish thresholds for partial degradation (e.g., degraded API response) and determine when it counts as an outage.
- Configure time-zone-aware blackout periods for regional maintenance without affecting global availability reporting.
- Validate third-party provider uptime claims by cross-referencing internal telemetry with vendor-reported SLA data.
- Design data retention policies for availability metrics to support long-term trend analysis and audit requirements.
Module 2: Architecting for High Availability
- Choose between active-passive and active-active architectures based on RTO/RPO requirements and data consistency needs.
- Implement health checks at the load balancer level that propagate downstream to detect application-layer failures.
- Distribute stateful services across availability zones using replication strategies that balance consistency and latency.
- Configure DNS failover mechanisms with appropriate TTL settings to minimize propagation delay during outages.
- Design retry logic with exponential backoff and jitter to prevent thundering herd problems during transient failures.
- Integrate circuit breakers into service-to-service communication to isolate failing components and preserve system stability.
- Select storage replication modes (synchronous vs. asynchronous) based on distance between regions and acceptable data loss.
- Validate failover automation through controlled chaos engineering experiments without impacting production users.
Module 3: Incident Management and Outage Response
- Define escalation paths for availability incidents based on severity, business impact, and time since detection.
- Implement incident bridges with standardized roles (incident commander, comms lead, tech lead) and documented runbooks.
- Configure real-time alerting that suppresses noise by correlating related signals (e.g., latency spikes and error rates).
- Establish post-mortem processes that require root cause analysis, timeline reconstruction, and action item tracking.
- Use status page APIs to automatically update external stakeholders during ongoing incidents.
- Integrate incident timelines with monitoring tools to reconstruct sequences of events from logs, metrics, and traces.
- Enforce communication protocols for internal updates during outages to prevent information silos across teams.
- Conduct blameless retrospectives with mandatory participation from all involved engineering and operations teams.
Module 4: Change and Deployment Risk Management
- Implement canary deployments with automated rollback triggers based on error rate and latency thresholds.
- Enforce deployment freezes during high-risk business periods (e.g., end-of-quarter, Black Friday).
- Require change advisory board (CAB) review for infrastructure modifications affecting core availability components.
- Integrate deployment pipelines with configuration management databases (CMDB) to track service dependencies.
- Use feature flags to decouple deployment from release, enabling gradual exposure and immediate disablement.
- Measure deployment failure rates per service and use them to prioritize reliability improvements.
- Implement dark launch capabilities to route production traffic to new systems without user exposure.
- Log all configuration changes in version control and enforce peer review before production application.
Module 5: Dependency and Third-Party Risk
- Map upstream and downstream dependencies for critical services using automated service discovery tools.
- Assess third-party API reliability through historical uptime data and contractual SLA enforceability.
- Implement fallback mechanisms (e.g., cached responses, default values) for non-critical external dependencies.
- Negotiate right-to-audit clauses for vendors whose failures could trigger regulatory or financial penalties.
- Monitor DNS provider health independently and prepare for DNS resolution failures with local caching strategies.
- Conduct quarterly business continuity drills that simulate failure of key SaaS providers (e.g., identity, email).
- Enforce rate limiting and quotas on internal services to prevent cascading failures from dependency overload.
- Classify dependencies by criticality and apply differentiated monitoring and alerting policies accordingly.
Module 6: Capacity Planning and Scalability
- Forecast traffic growth using historical trends and business roadmap inputs to plan infrastructure scaling cycles.
- Set auto-scaling policies based on predictive metrics (e.g., CPU, queue depth) rather than reactive thresholds.
- Conduct load testing under realistic conditions, including peak concurrency and mixed transaction types.
- Identify and eliminate single points of scaling bottlenecks (e.g., database connections, licensing limits).
- Implement horizontal partitioning (sharding) for databases when vertical scaling reaches economic or technical limits.
- Monitor resource utilization trends to detect "noisy neighbor" effects in shared environments.
- Size cloud instances based on sustained performance benchmarks, not peak burst capabilities.
- Establish capacity review meetings with product and infrastructure teams to align on growth assumptions.
Module 7: Monitoring and Observability Strategy
- Define SLOs with measurable SLIs (e.g., request success rate, latency percentiles) for each critical service.
- Instrument applications with structured logging to enable automated parsing and correlation during incidents.
- Deploy distributed tracing across microservices to identify latency bottlenecks in request flows.
- Set alert thresholds using error budgets to balance sensitivity with operational overhead.
- Consolidate monitoring tools to reduce tool sprawl while ensuring coverage across infrastructure, application, and business layers.
- Implement log retention and sampling strategies that comply with regulatory requirements and cost constraints.
- Use synthetic transactions to validate end-to-end workflows that are rarely triggered by real users.
- Configure anomaly detection on key metrics with manual review processes to prevent alert fatigue.
Module 8: Governance and Compliance Integration
- Align availability controls with regulatory frameworks (e.g., SOC 2, ISO 27001) requiring documented resilience measures.
- Document and test disaster recovery plans annually to meet audit requirements and insurance conditions.
- Classify systems by criticality to apply differentiated availability controls and reporting obligations.
- Implement access controls for production changes that enforce separation of duties and dual approval.
- Retain incident records and post-mortem reports for audit trail completeness and legal defensibility.
- Report availability metrics to executive leadership and board committees on a quarterly basis.
- Validate backup integrity through periodic restore tests and document results for compliance evidence.
- Coordinate with legal and risk teams to assess liability exposure from SLA breaches in customer contracts.
Module 9: Continuous Reliability Improvement
- Track reliability KPIs (e.g., MTTR, MTBF, change failure rate) across teams to identify improvement opportunities.
- Conduct fault injection tests in production with controlled blast radius and real-time rollback capability.
- Integrate reliability requirements into the software development lifecycle via architecture review gates.
- Run GameDay exercises with cross-functional teams to validate incident response under realistic conditions.
- Use error budget policies to govern feature release velocity and prevent reliability erosion.
- Benchmark reliability practices against industry peers to identify gaps in tooling, process, or staffing.
- Establish reliability champions within product teams to drive ownership beyond centralized SRE functions.
- Review and update runbooks quarterly to reflect changes in architecture, dependencies, and personnel.