This curriculum spans the full lifecycle of availability management, equivalent in scope to a multi-workshop operational resilience program, covering technical design, cross-team coordination, and governance practices used in large-scale service operations.
Module 1: Defining and Measuring System Availability
- Selecting appropriate availability metrics (e.g., uptime percentage, MTBF, MTTR) based on service criticality and business SLAs
- Implementing synthetic transaction monitoring to simulate user workflows and detect degradation before real users are impacted
- Configuring time windows for scheduled maintenance without violating availability commitments in global operations
- Integrating business transaction data with availability metrics to correlate technical uptime with actual service usability
- Establishing thresholds for degraded performance that trigger availability alerts, even when systems remain technically "up"
- Designing data collection intervals that balance monitoring granularity with storage and processing overhead
- Validating monitoring tool accuracy by cross-referencing logs, network probes, and application health endpoints
- Documenting assumptions in availability calculations, such as failover success rates and dependency behavior
Module 2: Availability Requirements and SLA Negotiation
- Translating business continuity objectives into technical availability targets for individual components and end-to-end services
- Negotiating SLA terms with legal and procurement teams, including exclusion clauses for third-party dependencies and force majeure
- Mapping service dependencies to quantify cascading failure risks and allocate availability budgets across subsystems
- Defining measurement methodologies in SLAs to prevent disputes over data sources and calculation logic
- Setting differentiated availability targets for peak vs. off-peak business hours based on usage patterns
- Establishing escalation paths and remediation timelines for SLA breaches that align with business impact severity
- Documenting assumptions about client-side infrastructure when defining end-user availability commitments
- Revising SLAs in response to architectural changes such as cloud migration or third-party API integration
Module 3: High Availability Architecture Design
- Selecting active-passive vs. active-active configurations based on data consistency requirements and recovery time objectives
- Distributing stateful components across failure domains while managing session persistence and data replication overhead
- Implementing health checks that accurately reflect service readiness, avoiding false positives from partially functional nodes
- Designing cross-region failover mechanisms with DNS TTL, traffic routing policies, and data synchronization strategies
- Validating redundancy at all layers, including load balancers, databases, and configuration management systems
- Introducing circuit breakers and bulkheads to contain failures in microservices architectures
- Assessing cost-performance trade-offs of multi-cloud vs. single-cloud high availability strategies
- Planning for asymmetric capacity in failover sites to balance cost and acceptable performance degradation
Module 4: Fault Tolerance and Resilience Engineering
- Implementing retry logic with exponential backoff and jitter to prevent thundering herd problems during transient outages
- Designing idempotent APIs to ensure safe retry of failed operations without unintended side effects
- Introducing chaos engineering practices, such as controlled failure injection, to validate system resilience
- Configuring watchdog timers and self-healing scripts to automatically restart or replace failed components
- Using canary deployments to test resilience changes on a subset of traffic before full rollout
- Hardening systems against cascading failures by rate-limiting downstream service calls during degradation
- Implementing graceful degradation modes that preserve core functionality when non-essential services are unavailable
- Validating backup systems under load to ensure they can sustain operations during extended primary system outages
Module 5: Change and Configuration Management for Stability
- Enforcing change freeze windows during critical business periods and defining emergency change protocols
- Implementing immutable infrastructure patterns to reduce configuration drift and improve deployment consistency
- Using feature flags to decouple deployment from release, enabling rollback without code reversion
- Validating configuration changes in staging environments that mirror production topology and load
- Automating configuration drift detection and remediation using infrastructure-as-code tools
- Requiring peer review and approval workflows for changes to high-impact components
- Logging and auditing all configuration changes with user attribution and rollback capabilities
- Coordinating change schedules across interdependent teams to prevent unintended integration failures
Module 6: Monitoring, Alerting, and Incident Response
- Designing alerting rules that minimize false positives while ensuring critical failures are detected promptly
- Implementing alert deduplication and correlation to prevent incident overload during systemic outages
- Establishing on-call rotation schedules with escalation policies and fatigue management rules
- Integrating monitoring systems with incident management platforms to automate ticket creation and status updates
- Defining runbooks with step-by-step recovery procedures for common failure scenarios
- Conducting post-mortems with blameless analysis to identify systemic issues and prevent recurrence
- Using real-time dashboards to provide situational awareness during active incidents
- Validating alert delivery paths across multiple channels (SMS, email, voice) to ensure reachability
Module 7: Disaster Recovery and Business Continuity Planning
- Classifying systems by recovery time and point objectives to prioritize DR investment
- Designing data replication strategies that meet RPO requirements while managing bandwidth and storage costs
- Conducting regular disaster recovery drills with full failover and failback procedures
- Securing access to DR sites and ensuring credentials and decryption keys are available during outages
- Documenting manual workarounds for automated processes that may fail during disasters
- Coordinating DR testing with business units to validate operational continuity
- Updating DR plans following architectural changes, mergers, or regulatory updates
- Storing backup media offsite with environmental and access controls matching production standards
Module 8: Dependency and Third-Party Risk Management
- Mapping upstream and downstream dependencies to identify single points of failure
- Assessing third-party SLAs and monitoring actual performance against contractual commitments
- Implementing fallback mechanisms for critical external APIs, such as cached responses or alternate providers
- Requiring contractual right-to-audit clauses for vendors supporting mission-critical services
- Monitoring DNS and certificate health for external dependencies to detect provider-level issues
- Limiting blast radius by sandboxing third-party integrations and enforcing strict network segmentation
- Conducting vendor business continuity assessments as part of procurement due diligence
- Designing abstraction layers to minimize integration coupling and simplify vendor replacement
Module 9: Continuous Improvement and Availability Governance
- Establishing availability review boards to evaluate architectural changes and risk exposure
- Tracking availability trends across services to identify systemic weaknesses and prioritize remediation
- Conducting root cause analysis on near-misses and minor outages to prevent major failures
- Updating availability models based on post-incident findings and evolving business requirements
- Aligning availability investments with risk-based cost-benefit analysis, including downtime cost estimates
- Integrating availability KPIs into executive reporting and performance management frameworks
- Standardizing availability design patterns and configuration baselines across technology domains
- Revising governance policies in response to regulatory changes or audit findings