Description

This curriculum spans the breadth of a multi-workshop program, covering the technical, operational, and governance dimensions of availability targets as they are defined, implemented, and maintained across complex service environments.

Module 1: Defining and Classifying Service Availability

Selecting appropriate availability classifications (e.g., mission-critical, business-essential, non-essential) based on business impact analysis and stakeholder input.
Mapping application dependencies to determine cascading failure risks and their influence on availability classifications.
Establishing criteria for defining "available" states, including response time thresholds, transaction success rates, and user access validation.
Documenting uptime expectations for non-production environments (e.g., staging, UAT) to align with release management cycles.
Aligning availability definitions with existing ITIL or SRE frameworks without creating conflicting terminology.
Handling discrepancies between end-user perceived availability and system-reported uptime through synthetic monitoring integration.
Defining failover eligibility for services based on recovery time and data loss tolerance.
Creating service boundary diagrams to clarify scope and prevent overcommitment in availability promises.

Module 2: Establishing Realistic Availability Targets

Calculating achievable uptime percentages based on historical incident data and infrastructure reliability metrics.
Balancing stakeholder demands for "five nines" (99.999%) against cost, complexity, and technical feasibility.
Differentiating between infrastructure availability and end-to-end service availability when setting targets.
Adjusting availability targets for services with scheduled maintenance windows or batch processing cycles.
Factoring in third-party dependencies (e.g., cloud providers, APIs) when committing to internal or external SLAs.
Setting tiered availability targets for different customer segments or contract levels.
Using Mean Time to Recovery (MTTR) and Mean Time Between Failures (MTBF) to validate target realism.
Documenting assumptions and exclusions (e.g., force majeure, DDoS attacks) that affect target applicability.

Module 3: Architecting for High Availability

Selecting active-active vs. active-passive configurations based on data consistency requirements and failover tolerance.
Designing stateless services to enable horizontal scaling and reduce single points of failure.
Implementing health checks and readiness probes that accurately reflect service operability.
Integrating circuit breakers and retry mechanisms to prevent cascading failures during partial outages.
Choosing replication strategies (synchronous vs. asynchronous) based on RPO and latency constraints.
Deploying multi-region or multi-zone architectures while managing data sovereignty and latency trade-offs.
Validating failover procedures through automated chaos engineering tests in pre-production environments.
Ensuring DNS and load balancer configurations support rapid traffic rerouting during incidents.

Module 4: Monitoring and Measuring Availability

Configuring synthetic transactions to simulate critical user journeys and detect functional unavailability.
Correlating infrastructure metrics (CPU, memory) with application-level health indicators to reduce false positives.
Setting up alerting thresholds that distinguish between transient issues and sustained outages.
Calculating rolling availability percentages using precise time-weighted methods to avoid data skew.
Integrating third-party monitoring data into internal dashboards for holistic availability views.
Handling clock drift and timezone inconsistencies in distributed system logs during incident analysis.
Excluding planned maintenance periods from availability calculations using automated scheduling hooks.
Validating monitoring coverage across all service components, including background workers and queues.

Module 5: Incident Management and Availability Impact

Classifying incidents by availability impact (e.g., partial degradation, complete outage) for accurate SLA tracking.
Integrating incident timelines with availability reporting to support root cause and duration analysis.
Defining escalation paths that activate based on duration and severity of availability breaches.
Coordinating communication between SRE, NOC, and business units during ongoing outages affecting SLAs.
Using post-incident reviews to identify recurring availability risks and update architectural controls.
Managing customer notifications without prematurely declaring outages before confirmation.
Logging incident response actions to support audit requirements and regulatory reporting.
Assessing whether workarounds restore functional availability or merely mask underlying failures.

Module 6: SLA Design and Contractual Integration

Negotiating SLA exclusions for planned maintenance, customer-caused outages, and force majeure events.
Defining precise measurement methodologies in SLAs to prevent disputes over reported uptime.
Aligning internal SLOs with external SLAs to ensure operational feasibility and accountability.
Structuring penalty clauses that reflect actual business impact without creating financial disincentives for transparency.
Specifying data sources and audit rights for third-party verification of SLA compliance.
Handling SLA aggregation across multiple services or components with interdependent availability.
Updating SLAs when service scope or architecture changes (e.g., migration to cloud, new dependencies).
Documenting SLA exceptions for beta, preview, or experimental features offered without guarantees.

Module 7: Capacity Planning and Scalability for Availability

Forecasting traffic growth and provisioning capacity to prevent resource exhaustion outages.
Implementing auto-scaling policies that respond to real-time load while avoiding thrashing.
Conducting load testing to validate system behavior at peak and sustained capacity levels.
Reserving failover capacity in secondary regions without incurring unnecessary idle costs.
Managing database connection pool limits to prevent exhaustion during traffic spikes.
Planning for seasonal or event-driven load variations (e.g., fiscal closing, marketing campaigns).
Using capacity trend analysis to justify infrastructure investments that improve availability.
Coordinating capacity updates with change management to minimize deployment-related outages.

Module 8: Governance, Reporting, and Continuous Improvement

Producing monthly availability reports with breakdowns by service, region, and incident category.
Presenting availability data to executive stakeholders using business-aligned KPIs, not technical metrics.
Conducting SLA compliance audits to verify accuracy of reported uptime and incident records.
Updating availability targets based on business evolution, technology refresh, or risk appetite changes.
Integrating availability performance into vendor scorecards for third-party service providers.
Standardizing incident classification and reporting across teams to ensure data consistency.
Using availability trends to prioritize reliability engineering initiatives in roadmap planning.
Enforcing change advisory board (CAB) reviews for modifications that could impact availability targets.

Module 9: Regulatory and Compliance Considerations

Mapping availability requirements to regulatory mandates (e.g., financial transaction systems, healthcare platforms).
Designing audit trails that demonstrate continuous compliance with availability obligations.
Handling jurisdiction-specific data residency rules when deploying redundant systems.
Documenting business continuity and disaster recovery plans for regulatory inspections.
Ensuring availability logging meets retention periods required by industry standards (e.g., PCI-DSS, HIPAA).
Coordinating with legal teams to assess liability exposure from unmet availability commitments.
Implementing role-based access controls for availability reporting to meet segregation of duties requirements.
Validating that third-party providers comply with contractual and regulatory availability obligations.