Description

This curriculum spans the breadth of a multi-workshop operational readiness program, addressing the same availability challenges tackled in enterprise advisory engagements—from SLI negotiations and regional failover design to incident command and cost-optimized scalability—while incorporating the depth of internal capability builds seen in mature platform engineering teams.

Module 1: Defining Availability Requirements and SLIs

Selecting appropriate service-level indicators (SLIs) based on user-facing transactions rather than infrastructure metrics
Negotiating SLOs with product and business stakeholders when uptime requirements conflict with release velocity
Implementing error budget policies that balance innovation pace with system stability
Differentiating between perceived and measured availability in customer-facing applications
Designing SLIs for asynchronous workflows where request-response patterns don't apply
Mapping dependencies across microservices to determine composite SLI calculations
Handling SLI drift due to changes in traffic patterns or user behavior over time
Integrating business KPIs with technical availability metrics for executive reporting

Module 2: Architecture for High Availability

Choosing active-active vs active-passive deployment models based on data consistency requirements
Implementing regional failover mechanisms with DNS and global load balancers
Designing stateless services to enable seamless horizontal scaling and failover
Managing distributed session state across availability zones without single points of failure
Configuring quorum-based consensus in distributed databases during network partitions
Validating cross-region data replication lag under peak load conditions
Selecting appropriate retry strategies and backoff algorithms for inter-service communication
Architecting for graceful degradation when dependent services are unavailable

Module 3: Monitoring and Observability

Instrumenting services with structured logging to enable automated anomaly detection
Setting dynamic alert thresholds based on historical traffic patterns and seasonality
Reducing alert fatigue by implementing alert grouping, routing, and escalation policies
Correlating metrics, logs, and traces to identify root causes during outages
Deploying synthetic transactions to monitor availability from external vantage points
Validating monitoring coverage across all critical user journeys and backend dependencies
Managing cardinality explosion in metrics and logs when scaling to millions of users
Integrating observability data with incident management systems for faster MTTR

Module 4: Incident Management and Response

Establishing on-call rotations with clear escalation paths and fatigue mitigation
Conducting blameless postmortems that focus on systemic issues rather than individuals
Implementing incident command structures (e.g., IC, comms lead, ops lead) during major outages
Automating incident documentation and timeline reconstruction from monitoring data
Managing external communications during customer-impacting incidents
Testing incident response procedures through regular game days and fire drills
Integrating runbooks into incident response platforms for real-time access
Measuring and improving mean time to detect (MTTD) and mean time to resolve (MTTR)

Module 5: Change and Release Management

Implementing canary releases with automated rollback based on SLO violations
Enforcing change advisory board (CAB) reviews for high-risk deployments
Using feature flags to decouple deployment from release and enable quick disablement
Validating backward compatibility in APIs before rolling out new versions
Coordinating deployment windows across interdependent teams with shared services
Tracking deployment health using build metadata and deployment identifiers in logs
Managing configuration drift between environments that impacts availability
Enforcing deployment freeze periods during critical business events

Module 6: Disaster Recovery and Business Continuity

Classifying systems by recovery time objective (RTO) and recovery point objective (RPO)
Testing full data center failover without disrupting production traffic
Validating backup integrity and restoration procedures on a regular schedule
Documenting and updating disaster recovery playbooks as architectures evolve
Storing encryption keys and credentials in geographically dispersed secure vaults
Conducting tabletop exercises with legal, PR, and executive teams during DR planning
Ensuring third-party dependencies have equivalent DR capabilities
Managing data sovereignty requirements during cross-border failover

Module 7: Capacity Planning and Scalability

Forecasting resource needs based on historical growth and product roadmap
Implementing auto-scaling policies that respond to both load and error rates
Conducting load testing to validate system behavior under peak and surge conditions
Identifying and eliminating scalability bottlenecks in stateful components
Right-sizing cloud instances based on actual utilization and cost-performance trade-offs
Managing cold start issues in serverless environments during traffic spikes
Planning for sudden traffic surges due to marketing campaigns or viral events
Using chaos engineering to test scaling limits and failure modes

Module 8: Governance, Compliance, and Risk

Aligning availability controls with regulatory requirements (e.g., HIPAA, PCI-DSS)
Documenting risk acceptance decisions for known single points of failure
Conducting third-party audits of cloud provider SLAs and operational practices
Managing access controls for production systems to prevent unauthorized changes
Implementing change logging and audit trails for compliance reporting
Assessing vendor lock-in risks when using proprietary high-availability services
Establishing data retention policies for logs and monitoring data
Performing annual risk assessments that include availability threats