This curriculum spans the design, execution, and governance of availability testing practices at the scale and rigor of an enterprise SRE program, covering the same technical depth and cross-team coordination required in multi-phase reliability initiatives across cloud infrastructure, CI/CD systems, and regulated environments.
Module 1: Defining Availability Requirements and SLIs
- Selecting appropriate service level indicators (SLIs) such as request success rate, latency thresholds, or system uptime based on business-critical transactions.
- Mapping user journeys to identify which components must be included in availability calculations for end-to-end reliability.
- Negotiating SLI precision with stakeholders when monitoring data sources have gaps or sampling limitations.
- Deciding between synthetic and real user monitoring (RUM) for SLI derivation based on traffic patterns and observability coverage.
- Establishing error budget policies that align with SLI targets and organizational risk tolerance.
- Handling SLI drift due to architectural changes, such as microservices decommissioning or API versioning.
- Documenting SLI calculation methodologies to ensure consistency across teams and audit readiness.
- Integrating SLI definitions into CI/CD pipelines to enforce availability gates during deployments.
Module 2: Designing Fault Injection and Chaos Engineering Programs
- Selecting production-like environments for chaos experiments while minimizing customer impact.
- Defining blast radius controls such as traffic percentage, region scope, and duration for fault injection tests.
- Coordinating with incident management teams to ensure proper alerting and rollback procedures are in place before running experiments.
- Choosing between open-source (e.g., Chaos Monkey) and commercial tools based on integration needs and support requirements.
- Developing automated rollback triggers based on SLI degradation during live experiments.
- Documenting experiment hypotheses and outcomes to refine system resilience assumptions.
- Obtaining change advisory board (CAB) approvals for production experiments under change management policies.
- Scaling chaos experiments across hybrid cloud and on-premises environments with consistent tooling.
Module 3: Implementing Automated Health Checks and Monitoring
- Designing layered health checks that differentiate between application liveness and service readiness.
- Configuring health check intervals and timeouts to avoid false positives during transient load spikes.
- Integrating health endpoints with load balancers and orchestration platforms (e.g., Kubernetes probes).
- Securing health endpoints to prevent information leakage while allowing external monitoring access.
- Correlating health check failures with distributed tracing data to identify root causes.
- Managing version skew in health check logic during rolling deployments.
- Using canary health checks to validate new service versions before full rollout.
- Automating health check validation in staging environments as part of deployment pipelines.
Module 4: Conducting Site Reliability and Failover Testing
- Planning DNS and load balancer failover tests without disrupting active user traffic.
- Validating data replication lag between primary and secondary regions during simulated outages.
- Testing automated failover mechanisms in multi-cloud architectures with vendor-specific tooling.
- Measuring recovery time objectives (RTO) and recovery point objectives (RPO) under real failure conditions.
- Coordinating cross-team failover drills involving database, network, and application teams.
- Updating runbooks based on gaps identified during failover test execution.
- Handling stateful services (e.g., databases, message queues) during failover without data loss.
- Logging and auditing all failover test activities for compliance and post-mortem analysis.
Module 5: Managing Dependencies and Third-Party Service Risks
- Mapping upstream dependencies to assess cascading failure risks during availability testing.
- Establishing fallback mechanisms (e.g., caching, circuit breakers) for critical third-party API dependencies.
- Testing system behavior when dependent services return degraded responses or timeouts.
- Negotiating SLAs with vendors and validating them through independent monitoring.
- Simulating third-party outages using proxy tools (e.g., Toxiproxy) in staging environments.
- Documenting dependency risk profiles and communicating them to business stakeholders.
- Updating integration points when third-party services deprecate APIs or change rate limits.
- Isolating dependency testing in sandboxed environments to prevent unintended production calls.
Module 6: Integrating Availability Testing into CI/CD Pipelines
- Embedding synthetic transaction checks in deployment pipelines to validate service availability post-deploy.
- Setting pass/fail criteria for availability tests based on performance baselines and error budgets.
- Managing test data provisioning for availability checks in ephemeral environments.
- Configuring pipeline timeouts and retries to avoid false negatives during infrastructure provisioning.
- Linking availability test results to deployment tracking systems for audit trails.
- Running smoke tests on canary instances before routing production traffic.
- Securing pipeline access to production-like environments used for availability validation.
- Optimizing test execution duration to minimize pipeline bottlenecks without sacrificing coverage.
Module 7: Analyzing and Acting on Test Results and Incidents
- Correlating availability test failures with production incident data to prioritize remediation.
- Classifying outages by cause (e.g., configuration error, capacity exhaustion, dependency failure) for trend analysis.
- Generating action items from test post-mortems with assigned owners and timelines.
- Updating monitoring dashboards and alerting rules based on test findings.
- Validating fix effectiveness by re-running failed availability tests in controlled environments.
- Archiving test results for compliance and long-term reliability trend analysis.
- Integrating test insights into capacity planning and architecture review processes.
- Using statistical process control to distinguish normal variance from systemic reliability issues.
Module 8: Governing Availability Testing at Scale
- Establishing centralized testing standards while allowing domain-specific adaptations.
- Allocating testing quotas to prevent resource contention in shared environments.
- Managing permissions and access controls for fault injection and production testing tools.
- Enforcing testing policies through infrastructure-as-code (IaC) guardrails and policy engines.
- Conducting periodic audits of test coverage across critical services.
- Reporting availability test metrics to executive stakeholders without oversimplifying technical context.
- Aligning testing schedules with change freeze periods and business cycles.
- Training SRE and DevOps teams on standardized testing procedures and tooling.
Module 9: Ensuring Compliance and Audit Readiness
- Documenting test procedures and results to meet regulatory requirements (e.g., SOC 2, HIPAA).
- Retaining test logs and system snapshots for forensic analysis during audits.
- Redacting sensitive data from test outputs before sharing with external auditors.
- Mapping availability tests to control objectives in internal compliance frameworks.
- Validating that disaster recovery tests meet industry-specific regulatory timelines.
- Coordinating with legal and risk teams to assess liability implications of planned failure tests.
- Updating business continuity plans based on test outcomes and gaps identified.
- Preparing evidence packages for auditors that demonstrate consistent testing execution over time.