This curriculum spans the design, execution, and governance of availability management practices across complex, multi-tiered systems, comparable in scope to a multi-workshop operational resilience program for large-scale cloud environments.
Module 1: Defining Service Availability Objectives
- Selecting SLA metrics (e.g., uptime percentage vs. request success rate) based on business-critical transaction paths
- Negotiating RTO and RPO thresholds with stakeholders for multi-tiered applications with interdependent components
- Mapping application dependencies to define scope boundaries for availability commitments
- Deciding whether to include scheduled maintenance in availability calculations
- Aligning availability targets across cloud provider SLAs and internal service agreements
- Documenting exclusions (e.g., force majeure, customer misconfigurations) to prevent disputes during incident reviews
- Establishing escalation paths when availability thresholds are breached
Module 2: Architecting for High Availability
- Choosing between active-passive and active-active deployment models based on data consistency requirements
- Implementing cross-AZ database replication with failover automation while managing replication lag risks
- Designing stateless application layers to enable horizontal scaling and seamless instance replacement
- Integrating health checks with load balancers to exclude unhealthy instances without manual intervention
- Selecting managed vs. self-hosted failover solutions based on operational overhead and control needs
- Validating redundancy at all layers (compute, storage, network, DNS) to eliminate single points of failure
- Configuring multi-region failover triggers based on synthetic monitoring results
Module 3: Monitoring and Failure Detection
- Calibrating alert thresholds to balance sensitivity with operational noise
- Deploying synthetic transactions to detect degradation before user impact
- Correlating infrastructure metrics with application-level errors to identify root causes faster
- Implementing heartbeat monitoring for background job processors and message queues
- Using canary checks to detect regional outages in cloud provider services
- Designing observability pipelines to ensure monitoring systems remain available during outages
- Integrating third-party status pages into internal dashboards for external dependency tracking
Module 4: Incident Response and Failover Execution
- Activating runbooks only after confirming failure scope and ruling out false positives
- Executing DNS failover with appropriate TTL settings to balance propagation speed and caching stability
- Validating data consistency before promoting a standby database to primary
- Coordinating failover timing across dependent services to prevent partial outages
- Documenting real-time decisions during failover for post-incident review
- Managing user communication during failover without disclosing system vulnerabilities
- Disabling automated scaling during failover to prevent race conditions
Module 5: Dependency and Third-Party Risk Management
- Assessing the availability posture of SaaS providers through audit reports and uptime history
- Implementing circuit breakers for external API dependencies to prevent cascading failures
- Negotiating contractual SLAs with third-party vendors that align with internal commitments
- Designing fallback modes (e.g., cached responses, offline functionality) for critical external dependencies
- Conducting regular failover drills involving third-party support teams
- Monitoring DNS and certificate health for externally hosted services
- Inventorying shadow IT services that introduce unmanaged availability risks
Module 6: Data Resilience and Recovery
- Scheduling backups during low-traffic periods while ensuring RPO compliance
- Testing backup restoration procedures quarterly to validate recovery integrity
- Encrypting backups and managing key access to prevent recovery delays during incidents
- Storing backup copies in geographically isolated regions to survive regional disasters
- Implementing immutable backups to protect against ransomware or malicious deletion
- Validating transaction log replay processes for databases requiring point-in-time recovery
- Documenting data loss exposure during recovery windows for stakeholder awareness
Module 7: Change and Configuration Management
- Requiring peer review for configuration changes to production environments
- Scheduling maintenance windows during periods of lowest user activity
- Using feature flags to decouple deployment from release, reducing deployment risk
- Rolling back failed deployments using versioned infrastructure templates
- Enforcing immutable infrastructure to prevent configuration drift
- Conducting pre-change impact assessments for interdependent services
- Logging all configuration changes with audit trails for forensic analysis
Module 8: Post-Incident Analysis and Continuous Improvement
- Conducting blameless postmortems with participation from all involved teams
- Classifying incident root causes as technical, process, or communication failures
- Prioritizing remediation actions based on recurrence likelihood and impact severity
- Tracking remediation tasks in a public dashboard to maintain accountability
- Updating runbooks and monitoring rules based on incident findings
- Revising availability targets when business requirements or technical constraints evolve
- Sharing anonymized incident summaries across teams to promote organizational learning
Module 9: Governance and Compliance Integration
- Aligning availability controls with regulatory requirements (e.g., HIPAA, GDPR, SOX)
- Documenting availability controls for external auditors and certification bodies
- Implementing access controls for failover procedures to prevent unauthorized execution
- Retaining incident records for legally mandated periods
- Conducting availability testing during compliance audit cycles
- Reporting availability metrics to executive leadership and board committees
- Updating business continuity plans to reflect changes in system architecture