This curriculum spans the breadth of a multi-workshop operational readiness program, addressing the same availability challenges tackled in enterprise advisory engagements—from SLI negotiations and regional failover design to incident command and cost-optimized scalability—while incorporating the depth of internal capability builds seen in mature platform engineering teams.
Module 1: Defining Availability Requirements and SLIs
- Selecting appropriate service-level indicators (SLIs) based on user-facing transactions rather than infrastructure metrics
- Negotiating SLOs with product and business stakeholders when uptime requirements conflict with release velocity
- Implementing error budget policies that balance innovation pace with system stability
- Differentiating between perceived and measured availability in customer-facing applications
- Designing SLIs for asynchronous workflows where request-response patterns don't apply
- Mapping dependencies across microservices to determine composite SLI calculations
- Handling SLI drift due to changes in traffic patterns or user behavior over time
- Integrating business KPIs with technical availability metrics for executive reporting
Module 2: Architecture for High Availability
- Choosing active-active vs active-passive deployment models based on data consistency requirements
- Implementing regional failover mechanisms with DNS and global load balancers
- Designing stateless services to enable seamless horizontal scaling and failover
- Managing distributed session state across availability zones without single points of failure
- Configuring quorum-based consensus in distributed databases during network partitions
- Validating cross-region data replication lag under peak load conditions
- Selecting appropriate retry strategies and backoff algorithms for inter-service communication
- Architecting for graceful degradation when dependent services are unavailable
Module 3: Monitoring and Observability
- Instrumenting services with structured logging to enable automated anomaly detection
- Setting dynamic alert thresholds based on historical traffic patterns and seasonality
- Reducing alert fatigue by implementing alert grouping, routing, and escalation policies
- Correlating metrics, logs, and traces to identify root causes during outages
- Deploying synthetic transactions to monitor availability from external vantage points
- Validating monitoring coverage across all critical user journeys and backend dependencies
- Managing cardinality explosion in metrics and logs when scaling to millions of users
- Integrating observability data with incident management systems for faster MTTR
Module 4: Incident Management and Response
- Establishing on-call rotations with clear escalation paths and fatigue mitigation
- Conducting blameless postmortems that focus on systemic issues rather than individuals
- Implementing incident command structures (e.g., IC, comms lead, ops lead) during major outages
- Automating incident documentation and timeline reconstruction from monitoring data
- Managing external communications during customer-impacting incidents
- Testing incident response procedures through regular game days and fire drills
- Integrating runbooks into incident response platforms for real-time access
- Measuring and improving mean time to detect (MTTD) and mean time to resolve (MTTR)
Module 5: Change and Release Management
- Implementing canary releases with automated rollback based on SLO violations
- Enforcing change advisory board (CAB) reviews for high-risk deployments
- Using feature flags to decouple deployment from release and enable quick disablement
- Validating backward compatibility in APIs before rolling out new versions
- Coordinating deployment windows across interdependent teams with shared services
- Tracking deployment health using build metadata and deployment identifiers in logs
- Managing configuration drift between environments that impacts availability
- Enforcing deployment freeze periods during critical business events
Module 6: Disaster Recovery and Business Continuity
- Classifying systems by recovery time objective (RTO) and recovery point objective (RPO)
- Testing full data center failover without disrupting production traffic
- Validating backup integrity and restoration procedures on a regular schedule
- Documenting and updating disaster recovery playbooks as architectures evolve
- Storing encryption keys and credentials in geographically dispersed secure vaults
- Conducting tabletop exercises with legal, PR, and executive teams during DR planning
- Ensuring third-party dependencies have equivalent DR capabilities
- Managing data sovereignty requirements during cross-border failover
Module 7: Capacity Planning and Scalability
- Forecasting resource needs based on historical growth and product roadmap
- Implementing auto-scaling policies that respond to both load and error rates
- Conducting load testing to validate system behavior under peak and surge conditions
- Identifying and eliminating scalability bottlenecks in stateful components
- Right-sizing cloud instances based on actual utilization and cost-performance trade-offs
- Managing cold start issues in serverless environments during traffic spikes
- Planning for sudden traffic surges due to marketing campaigns or viral events
- Using chaos engineering to test scaling limits and failure modes
Module 8: Governance, Compliance, and Risk
- Aligning availability controls with regulatory requirements (e.g., HIPAA, PCI-DSS)
- Documenting risk acceptance decisions for known single points of failure
- Conducting third-party audits of cloud provider SLAs and operational practices
- Managing access controls for production systems to prevent unauthorized changes
- Implementing change logging and audit trails for compliance reporting
- Assessing vendor lock-in risks when using proprietary high-availability services
- Establishing data retention policies for logs and monitoring data
- Performing annual risk assessments that include availability threats
Module 9: Cost Optimization and Efficiency
- Evaluating the cost of over-provisioning against the risk of downtime
- Using reserved instances and savings plans without compromising flexibility
- Right-sizing databases and storage based on access patterns and retention needs
- Implementing automated shutdown of non-production environments during off-hours
- Measuring cost per transaction to identify inefficient services
- Negotiating SLAs with vendors based on actual usage and performance data
- Conducting cost reviews during postmortems to assess economic impact of outages
- Optimizing CDN and data transfer costs in globally distributed architectures