Description

This curriculum spans the full lifecycle of service availability management, equivalent in scope to a multi-phase internal capability program that integrates architecture, operations, governance, and continuous improvement practices across distributed engineering teams.

Module 1: Defining and Measuring Service Availability

Selecting appropriate availability metrics (e.g., uptime percentage, MTBF, MTTR) based on business criticality and service type
Establishing service-specific SLAs with measurable thresholds that align with business objectives and technical feasibility
Implementing synthetic transaction monitoring to proactively detect service degradation before user impact
Integrating real user monitoring (RUM) data with synthetic metrics to validate actual user experience
Calibrating measurement windows (e.g., rolling 28-day vs. calendar month) to avoid misleading availability reporting
Handling edge cases such as partial outages, regional failures, and degraded functionality in availability calculations
Documenting and versioning availability definitions to ensure consistency across teams and audits
Aligning availability measurement with incident management timelines to avoid double-counting or gaps

Module 2: High Availability Architecture Design

Selecting active-active vs. active-passive deployment models based on RTO, RPO, and cost constraints
Designing stateless services to enable seamless failover and horizontal scaling
Implementing data replication strategies (synchronous vs. asynchronous) across availability zones
Architecting cross-region failover mechanisms with automated DNS or traffic routing (e.g., DNS failover, GSLB)
Validating failover procedures through controlled chaos engineering experiments
Designing retry logic with exponential backoff and circuit breakers to prevent cascading failures
Ensuring session persistence mechanisms do not become single points of failure
Integrating health checks at multiple layers (network, application, data) to inform routing decisions

Module 3: Incident Response and Outage Management

Defining escalation paths and on-call rotations with clear ownership for availability-critical services
Implementing incident war room protocols with real-time communication and documentation standards
Using incident timelines to reconstruct outage sequences and identify root causes
Integrating monitoring alerts with incident management platforms to reduce mean time to acknowledge (MTTA)
Conducting blameless postmortems with structured templates to capture technical and process failures
Enforcing a 48-hour postmortem draft deadline to maintain accuracy and momentum
Tracking action items from postmortems in a centralized system with ownership and due dates
Classifying incidents by severity and business impact to prioritize remediation efforts

Module 4: Change and Deployment Risk Management

Requiring availability impact assessments for all changes to production environments
Implementing canary deployments with automated rollback triggers based on health metrics
Enforcing deployment freeze windows during peak business periods
Using feature flags to decouple deployment from release and enable rapid disablement
Requiring peer review of rollback procedures before high-risk changes
Logging all deployment activities in a centralized audit trail with immutable timestamps
Integrating deployment pipelines with monitoring systems to detect regressions immediately
Requiring pre-deployment validation of backup and recovery procedures for critical services

Module 5: Disaster Recovery Planning and Testing

Conducting business impact analysis (BIA) to define RTO and RPO for each critical system
Designing geographically isolated backup sites with independent power, network, and staffing
Establishing data backup schedules and retention policies aligned with recovery objectives
Scheduling regular disaster recovery tests with defined success criteria and participation requirements
Simulating partial and complete data center outages to validate failover and failback procedures
Measuring actual RTO and RPO during tests and adjusting architecture or processes accordingly
Documenting recovery runbooks with step-by-step instructions and contact information
Coordinating DR tests with external vendors and third-party service providers

Module 6: Monitoring and Alerting Strategy

Defining service-level objectives (SLOs) and error budgets to guide alerting thresholds
Implementing multi-dimensional alerting (latency, traffic, errors, saturation) using the RED method
Reducing alert fatigue by suppressing non-actionable alerts and routing alerts to appropriate teams
Using dynamic thresholds based on historical patterns to reduce false positives
Integrating synthetic and real user monitoring data into a unified observability dashboard
Validating alert effectiveness through periodic alert reviews and noise audits
Ensuring monitoring systems themselves are highly available and independently monitored
Standardizing metric naming and tagging conventions across teams for consistency

Module 7: Capacity and Performance Management

Forecasting resource demand based on historical growth trends and business initiatives
Conducting load testing under realistic traffic patterns to identify performance bottlenecks
Implementing auto-scaling policies with appropriate cooldown periods and metric triggers
Monitoring resource saturation (CPU, memory, I/O) to prevent performance degradation
Right-sizing cloud instances based on actual utilization and cost-performance trade-offs
Planning for seasonal spikes (e.g., end-of-month, holiday periods) with preemptive scaling
Using capacity models to evaluate the impact of new features on infrastructure requirements
Establishing early warning indicators for capacity exhaustion (e.g., disk space, connection pools)

Module 8: Governance and Compliance in Availability Management

Establishing an availability review board to approve architecture changes to critical systems
Conducting quarterly availability risk assessments with input from security, operations, and business units
Aligning availability controls with regulatory requirements (e.g., SOX, HIPAA, GDPR)
Documenting and auditing access controls for production environments and change management systems
Requiring third-party vendors to provide availability reports and undergo security assessments
Integrating availability metrics into executive reporting dashboards with trend analysis
Enforcing configuration management database (CMDB) accuracy for all availability-critical components
Conducting tabletop exercises with legal and PR teams to prepare for major outage communications

Module 9: Continuous Improvement and Maturity Assessment

Implementing a service availability maturity model to assess and track team capabilities
Conducting annual availability architecture reviews for critical services
Benchmarking availability performance against industry standards and peer organizations
Using error budget consumption rates to identify teams needing operational improvement
Integrating availability KPIs into team performance evaluations and planning cycles
Establishing a center of excellence to share best practices and tooling across teams
Rotating engineers through on-call and incident response roles to build operational empathy
Investing in automation to reduce toil and minimize human error in availability-critical processes