This curriculum spans the breadth of availability management work typically addressed across multi-workshop reliability engineering programs, covering the technical, procedural, and organisational practices used in mature incident response, resilience design, and compliance-aligned operations.
Module 1: Defining Service Availability Objectives and SLIs
- Selecting appropriate service-level indicators (SLIs) such as request success rate, latency thresholds, or error budgets based on business-critical workflows.
- Negotiating SLI definitions with product and operations teams when user-perceived availability diverges from backend health metrics.
- Implementing synthetic monitoring probes to simulate user journeys and measure availability independently of client-side reporting.
- Deciding whether to exclude maintenance windows from SLI calculations and documenting those exclusions in SLA contracts.
- Calibrating SLI measurement intervals (e.g., 5-minute vs. 1-minute) to balance responsiveness with noise reduction.
- Handling edge cases in SLI computation, such as partial failures in distributed transactions across microservices.
- Integrating third-party dependency health into internal SLIs when external APIs directly impact service functionality.
- Establishing thresholds for degraded service that trigger alerts before SLO breaches occur.
Module 2: Designing Resilient System Architectures
- Choosing between active-passive and active-active deployment models based on RTO and RPO requirements for critical services.
- Implementing circuit breakers in inter-service communication to prevent cascading failures during downstream outages.
- Designing retry mechanisms with exponential backoff and jitter to avoid thundering herd problems after service recovery.
- Deciding on data replication strategies (synchronous vs. asynchronous) across regions and their impact on consistency and availability.
- Introducing bulkheads to isolate failure domains in shared infrastructure such as message queues or databases.
- Configuring health checks for load balancers to exclude instances during startup or under high load.
- Evaluating the trade-offs of stateless vs. stateful services in failover scenarios and session recovery.
- Implementing feature flags to disable non-critical functionality during partial outages without full service rollback.
Module 3: Incident Detection and Alerting
- Configuring alerting rules to minimize false positives while ensuring timely detection of availability degradation.
- Setting dynamic thresholds for anomaly detection using historical traffic patterns instead of static values.
- Routing alerts to on-call engineers based on service ownership and escalation policies during overlapping incidents.
- Suppressing alerts during planned maintenance without disabling monitoring coverage.
- Correlating alerts across related services to identify root cause instead of reacting to symptom-level notifications.
- Integrating observability tools with incident management platforms to auto-create incident tickets with enriched context.
- Managing alert fatigue by tuning sensitivity and defining clear ownership for each alert type.
- Validating alert delivery paths through periodic test incidents and communication channel audits.
Module 4: Incident Response and Triage
- Declaring incident severity levels based on user impact, revenue loss, or regulatory exposure.
- Activating incident command roles (e.g., incident commander, comms lead) during major outages to coordinate response.
- Executing predefined runbooks for common failure scenarios while adapting to novel conditions.
- Blocking non-essential deployments and configuration changes during active incidents to reduce variables.
- Initiating rollback procedures when mitigation attempts worsen service degradation.
- Sharing real-time status updates with internal stakeholders without speculating on root cause.
- Preserving logs, metrics, and traces from the time of incident for postmortem analysis.
- Coordinating with legal and PR teams when outages involve data exposure or regulatory reporting obligations.
Module 5: Root Cause Analysis and Post-Incident Review
- Conducting blameless postmortems that focus on systemic factors rather than individual actions.
- Identifying contributing factors beyond the immediate technical failure, such as process gaps or training deficiencies.
- Documenting timelines with precise timestamps from multiple data sources to reconstruct incident progression.
- Classifying outages by type (e.g., deployment, configuration, dependency, capacity) to prioritize remediation efforts.
- Assigning measurable action items with owners and deadlines to address identified vulnerabilities.
- Integrating postmortem findings into architecture review checklists for new service designs.
- Archiving incident reports in a searchable knowledge base accessible to engineering teams.
- Reviewing recurring incident patterns quarterly to assess effectiveness of prior remediations.
Module 6: Change Management and Deployment Safety
- Requiring mandatory canary analysis before full rollout to detect availability regressions in production.
- Enforcing deployment freezes during peak business periods or major events.
- Implementing automated rollback triggers based on SLO violations during new version deployment.
- Requiring peer review of configuration changes that affect load balancers, DNS, or routing rules.
- Tracking change velocity and correlating deployment frequency with incident rates.
- Using feature flags with gradual rollouts to limit blast radius of faulty code paths.
- Validating rollback procedures regularly through controlled rollback drills.
- Integrating deployment pipelines with incident management systems to flag high-risk changes.
Module 7: Dependency and Third-Party Risk Management
- Mapping indirect dependencies (e.g., transitive libraries, shared infrastructure) that could introduce single points of failure.
- Negotiating SLAs with third-party vendors and verifying compliance through independent monitoring.
- Implementing local caching or fallback responses when external APIs become unresponsive.
- Conducting failover drills for critical SaaS providers with no active-active options.
- Assessing vendor lock-in risks that could delay recovery during provider-specific outages.
- Requiring contractual provisions for post-incident reporting from third-party providers.
- Monitoring DNS and certificate health for externally hosted dependencies to detect upstream issues early.
- Developing contingency plans for vendor bankruptcy or service deprecation.
Module 8: Capacity Planning and Scalability Engineering
- Forecasting traffic growth based on historical trends and business initiatives to plan infrastructure scaling.
- Conducting load testing under realistic conditions to identify bottlenecks before peak demand periods.
- Setting autoscaling policies that respond to both traffic volume and service-level health indicators.
- Reserving failover capacity in secondary regions without incurring idle cost penalties.
- Managing stateful service scaling challenges, such as distributed locking or data rebalancing.
- Implementing rate limiting and queuing mechanisms to protect backend systems during traffic surges.
- Right-sizing compute instances based on actual utilization metrics rather than peak theoretical loads.
- Planning for sudden traffic spikes due to viral content or external referrals with preemptive scaling rules.
Module 9: Governance, Compliance, and Audit Readiness
- Aligning availability practices with regulatory requirements such as GDPR, HIPAA, or PCI-DSS.
- Documenting business continuity and disaster recovery procedures for external audits.
- Generating availability reports for executive review and board-level risk assessments.
- Implementing access controls and audit logs for configuration changes affecting service availability.
- Retaining incident records and monitoring data for legally mandated timeframes.
- Conducting annual failover tests to validate recovery plans and meet compliance obligations.
- Mapping service dependencies for critical systems to support regulatory impact analyses.
- Coordinating with internal audit teams to verify controls around change management and incident response.