This curriculum spans the breadth of a multi-workshop program, covering the technical, operational, and governance dimensions of availability targets as they are defined, implemented, and maintained across complex service environments.
Module 1: Defining and Classifying Service Availability
- Selecting appropriate availability classifications (e.g., mission-critical, business-essential, non-essential) based on business impact analysis and stakeholder input.
- Mapping application dependencies to determine cascading failure risks and their influence on availability classifications.
- Establishing criteria for defining "available" states, including response time thresholds, transaction success rates, and user access validation.
- Documenting uptime expectations for non-production environments (e.g., staging, UAT) to align with release management cycles.
- Aligning availability definitions with existing ITIL or SRE frameworks without creating conflicting terminology.
- Handling discrepancies between end-user perceived availability and system-reported uptime through synthetic monitoring integration.
- Defining failover eligibility for services based on recovery time and data loss tolerance.
- Creating service boundary diagrams to clarify scope and prevent overcommitment in availability promises.
Module 2: Establishing Realistic Availability Targets
- Calculating achievable uptime percentages based on historical incident data and infrastructure reliability metrics.
- Balancing stakeholder demands for "five nines" (99.999%) against cost, complexity, and technical feasibility.
- Differentiating between infrastructure availability and end-to-end service availability when setting targets.
- Adjusting availability targets for services with scheduled maintenance windows or batch processing cycles.
- Factoring in third-party dependencies (e.g., cloud providers, APIs) when committing to internal or external SLAs.
- Setting tiered availability targets for different customer segments or contract levels.
- Using Mean Time to Recovery (MTTR) and Mean Time Between Failures (MTBF) to validate target realism.
- Documenting assumptions and exclusions (e.g., force majeure, DDoS attacks) that affect target applicability.
Module 3: Architecting for High Availability
- Selecting active-active vs. active-passive configurations based on data consistency requirements and failover tolerance.
- Designing stateless services to enable horizontal scaling and reduce single points of failure.
- Implementing health checks and readiness probes that accurately reflect service operability.
- Integrating circuit breakers and retry mechanisms to prevent cascading failures during partial outages.
- Choosing replication strategies (synchronous vs. asynchronous) based on RPO and latency constraints.
- Deploying multi-region or multi-zone architectures while managing data sovereignty and latency trade-offs.
- Validating failover procedures through automated chaos engineering tests in pre-production environments.
- Ensuring DNS and load balancer configurations support rapid traffic rerouting during incidents.
Module 4: Monitoring and Measuring Availability
- Configuring synthetic transactions to simulate critical user journeys and detect functional unavailability.
- Correlating infrastructure metrics (CPU, memory) with application-level health indicators to reduce false positives.
- Setting up alerting thresholds that distinguish between transient issues and sustained outages.
- Calculating rolling availability percentages using precise time-weighted methods to avoid data skew.
- Integrating third-party monitoring data into internal dashboards for holistic availability views.
- Handling clock drift and timezone inconsistencies in distributed system logs during incident analysis.
- Excluding planned maintenance periods from availability calculations using automated scheduling hooks.
- Validating monitoring coverage across all service components, including background workers and queues.
Module 5: Incident Management and Availability Impact
- Classifying incidents by availability impact (e.g., partial degradation, complete outage) for accurate SLA tracking.
- Integrating incident timelines with availability reporting to support root cause and duration analysis.
- Defining escalation paths that activate based on duration and severity of availability breaches.
- Coordinating communication between SRE, NOC, and business units during ongoing outages affecting SLAs.
- Using post-incident reviews to identify recurring availability risks and update architectural controls.
- Managing customer notifications without prematurely declaring outages before confirmation.
- Logging incident response actions to support audit requirements and regulatory reporting.
- Assessing whether workarounds restore functional availability or merely mask underlying failures.
Module 6: SLA Design and Contractual Integration
- Negotiating SLA exclusions for planned maintenance, customer-caused outages, and force majeure events.
- Defining precise measurement methodologies in SLAs to prevent disputes over reported uptime.
- Aligning internal SLOs with external SLAs to ensure operational feasibility and accountability.
- Structuring penalty clauses that reflect actual business impact without creating financial disincentives for transparency.
- Specifying data sources and audit rights for third-party verification of SLA compliance.
- Handling SLA aggregation across multiple services or components with interdependent availability.
- Updating SLAs when service scope or architecture changes (e.g., migration to cloud, new dependencies).
- Documenting SLA exceptions for beta, preview, or experimental features offered without guarantees.
Module 7: Capacity Planning and Scalability for Availability
- Forecasting traffic growth and provisioning capacity to prevent resource exhaustion outages.
- Implementing auto-scaling policies that respond to real-time load while avoiding thrashing.
- Conducting load testing to validate system behavior at peak and sustained capacity levels.
- Reserving failover capacity in secondary regions without incurring unnecessary idle costs.
- Managing database connection pool limits to prevent exhaustion during traffic spikes.
- Planning for seasonal or event-driven load variations (e.g., fiscal closing, marketing campaigns).
- Using capacity trend analysis to justify infrastructure investments that improve availability.
- Coordinating capacity updates with change management to minimize deployment-related outages.
Module 8: Governance, Reporting, and Continuous Improvement
- Producing monthly availability reports with breakdowns by service, region, and incident category.
- Presenting availability data to executive stakeholders using business-aligned KPIs, not technical metrics.
- Conducting SLA compliance audits to verify accuracy of reported uptime and incident records.
- Updating availability targets based on business evolution, technology refresh, or risk appetite changes.
- Integrating availability performance into vendor scorecards for third-party service providers.
- Standardizing incident classification and reporting across teams to ensure data consistency.
- Using availability trends to prioritize reliability engineering initiatives in roadmap planning.
- Enforcing change advisory board (CAB) reviews for modifications that could impact availability targets.
Module 9: Regulatory and Compliance Considerations
- Mapping availability requirements to regulatory mandates (e.g., financial transaction systems, healthcare platforms).
- Designing audit trails that demonstrate continuous compliance with availability obligations.
- Handling jurisdiction-specific data residency rules when deploying redundant systems.
- Documenting business continuity and disaster recovery plans for regulatory inspections.
- Ensuring availability logging meets retention periods required by industry standards (e.g., PCI-DSS, HIPAA).
- Coordinating with legal teams to assess liability exposure from unmet availability commitments.
- Implementing role-based access controls for availability reporting to meet segregation of duties requirements.
- Validating that third-party providers comply with contractual and regulatory availability obligations.