Description

This curriculum spans the breadth of availability management work typically addressed across multi-workshop reliability engineering programs, covering the technical, procedural, and organisational practices used in mature incident response, resilience design, and compliance-aligned operations.

Module 1: Defining Service Availability Objectives and SLIs

Selecting appropriate service-level indicators (SLIs) such as request success rate, latency thresholds, or error budgets based on business-critical workflows.
Negotiating SLI definitions with product and operations teams when user-perceived availability diverges from backend health metrics.
Implementing synthetic monitoring probes to simulate user journeys and measure availability independently of client-side reporting.
Deciding whether to exclude maintenance windows from SLI calculations and documenting those exclusions in SLA contracts.
Calibrating SLI measurement intervals (e.g., 5-minute vs. 1-minute) to balance responsiveness with noise reduction.
Handling edge cases in SLI computation, such as partial failures in distributed transactions across microservices.
Integrating third-party dependency health into internal SLIs when external APIs directly impact service functionality.
Establishing thresholds for degraded service that trigger alerts before SLO breaches occur.

Module 2: Designing Resilient System Architectures

Choosing between active-passive and active-active deployment models based on RTO and RPO requirements for critical services.
Implementing circuit breakers in inter-service communication to prevent cascading failures during downstream outages.
Designing retry mechanisms with exponential backoff and jitter to avoid thundering herd problems after service recovery.
Deciding on data replication strategies (synchronous vs. asynchronous) across regions and their impact on consistency and availability.
Introducing bulkheads to isolate failure domains in shared infrastructure such as message queues or databases.
Configuring health checks for load balancers to exclude instances during startup or under high load.
Evaluating the trade-offs of stateless vs. stateful services in failover scenarios and session recovery.
Implementing feature flags to disable non-critical functionality during partial outages without full service rollback.

Module 3: Incident Detection and Alerting

Configuring alerting rules to minimize false positives while ensuring timely detection of availability degradation.
Setting dynamic thresholds for anomaly detection using historical traffic patterns instead of static values.
Routing alerts to on-call engineers based on service ownership and escalation policies during overlapping incidents.
Suppressing alerts during planned maintenance without disabling monitoring coverage.
Correlating alerts across related services to identify root cause instead of reacting to symptom-level notifications.
Integrating observability tools with incident management platforms to auto-create incident tickets with enriched context.
Managing alert fatigue by tuning sensitivity and defining clear ownership for each alert type.
Validating alert delivery paths through periodic test incidents and communication channel audits.

Module 4: Incident Response and Triage

Declaring incident severity levels based on user impact, revenue loss, or regulatory exposure.
Activating incident command roles (e.g., incident commander, comms lead) during major outages to coordinate response.
Executing predefined runbooks for common failure scenarios while adapting to novel conditions.
Blocking non-essential deployments and configuration changes during active incidents to reduce variables.
Initiating rollback procedures when mitigation attempts worsen service degradation.
Sharing real-time status updates with internal stakeholders without speculating on root cause.
Preserving logs, metrics, and traces from the time of incident for postmortem analysis.
Coordinating with legal and PR teams when outages involve data exposure or regulatory reporting obligations.

Module 5: Root Cause Analysis and Post-Incident Review

Conducting blameless postmortems that focus on systemic factors rather than individual actions.
Identifying contributing factors beyond the immediate technical failure, such as process gaps or training deficiencies.
Documenting timelines with precise timestamps from multiple data sources to reconstruct incident progression.
Classifying outages by type (e.g., deployment, configuration, dependency, capacity) to prioritize remediation efforts.
Assigning measurable action items with owners and deadlines to address identified vulnerabilities.
Integrating postmortem findings into architecture review checklists for new service designs.
Archiving incident reports in a searchable knowledge base accessible to engineering teams.
Reviewing recurring incident patterns quarterly to assess effectiveness of prior remediations.

Module 6: Change Management and Deployment Safety

Requiring mandatory canary analysis before full rollout to detect availability regressions in production.
Enforcing deployment freezes during peak business periods or major events.
Implementing automated rollback triggers based on SLO violations during new version deployment.
Requiring peer review of configuration changes that affect load balancers, DNS, or routing rules.
Tracking change velocity and correlating deployment frequency with incident rates.
Using feature flags with gradual rollouts to limit blast radius of faulty code paths.
Validating rollback procedures regularly through controlled rollback drills.
Integrating deployment pipelines with incident management systems to flag high-risk changes.

Module 7: Dependency and Third-Party Risk Management

Mapping indirect dependencies (e.g., transitive libraries, shared infrastructure) that could introduce single points of failure.
Negotiating SLAs with third-party vendors and verifying compliance through independent monitoring.
Implementing local caching or fallback responses when external APIs become unresponsive.
Conducting failover drills for critical SaaS providers with no active-active options.
Assessing vendor lock-in risks that could delay recovery during provider-specific outages.
Requiring contractual provisions for post-incident reporting from third-party providers.
Monitoring DNS and certificate health for externally hosted dependencies to detect upstream issues early.
Developing contingency plans for vendor bankruptcy or service deprecation.

Module 8: Capacity Planning and Scalability Engineering

Forecasting traffic growth based on historical trends and business initiatives to plan infrastructure scaling.
Conducting load testing under realistic conditions to identify bottlenecks before peak demand periods.
Setting autoscaling policies that respond to both traffic volume and service-level health indicators.
Reserving failover capacity in secondary regions without incurring idle cost penalties.
Managing stateful service scaling challenges, such as distributed locking or data rebalancing.
Implementing rate limiting and queuing mechanisms to protect backend systems during traffic surges.
Right-sizing compute instances based on actual utilization metrics rather than peak theoretical loads.
Planning for sudden traffic spikes due to viral content or external referrals with preemptive scaling rules.

Module 9: Governance, Compliance, and Audit Readiness

Aligning availability practices with regulatory requirements such as GDPR, HIPAA, or PCI-DSS.
Documenting business continuity and disaster recovery procedures for external audits.
Generating availability reports for executive review and board-level risk assessments.
Implementing access controls and audit logs for configuration changes affecting service availability.
Retaining incident records and monitoring data for legally mandated timeframes.
Conducting annual failover tests to validate recovery plans and meet compliance obligations.
Mapping service dependencies for critical systems to support regulatory impact analyses.
Coordinating with internal audit teams to verify controls around change management and incident response.