This curriculum spans the design and operationalization of availability metrics across nine technical and organizational domains, equivalent in scope to a multi-phase internal capability program for establishing enterprise-wide SRE practices.
Module 1: Defining Availability in Business Contexts
- Select whether to measure availability based on user session success rate or system uptime, considering customer-facing SLAs.
- Determine the scope of "product" in availability metrics—whether to include backend services, APIs, or only end-user interfaces.
- Decide whether availability calculations should exclude scheduled maintenance windows and how to document such exclusions.
- Establish thresholds for degraded performance versus complete unavailability in telemetry systems.
- Align availability definitions across product, SRE, and customer support teams to prevent metric misinterpretation.
- Implement tagging logic in monitoring tools to distinguish between partial outages and full service failures.
- Negotiate with legal and compliance teams on how availability data may be used in contractual reporting.
Module 2: Instrumentation and Data Collection Architecture
- Choose between agent-based and agentless monitoring for availability tracking across hybrid environments.
- Configure synthetic transaction checks to simulate real user workflows, including multi-step authentication and checkout.
- Deploy heartbeat mechanisms with configurable intervals to balance network load and detection speed.
- Integrate third-party API status into internal availability dashboards using standardized polling intervals.
- Design data pipelines to aggregate availability signals from edge locations, regional data centers, and cloud providers.
- Implement data sampling strategies to reduce storage costs while preserving statistical accuracy for outage analysis.
- Validate timestamp synchronization across distributed systems to ensure accurate incident correlation.
Module 3: Calculating and Normalizing KPIs
- Select between rolling 28-day and calendar-month availability calculations based on billing cycle alignment.
- Apply weighted averaging for multi-region services where user traffic distribution affects overall availability impact.
- Adjust raw uptime percentages to account for known false positives in health check systems.
- Normalize KPIs across services with different deployment frequencies to enable cross-team benchmarking.
- Define exclusion criteria for external dependencies (e.g., CDN failures) in internal availability reporting.
- Implement correction factors for services with asynchronous components where "availability" includes job processing latency.
- Automate recalibration of baseline KPIs following infrastructure migrations or architectural changes.
Module 4: Thresholds, SLOs, and Error Budgets
- Set SLOs at 99.9% versus 99.95% based on historical performance and business risk tolerance for a given product tier.
- Allocate error budget consumption rules across deployment pipelines, allowing teams to burn budget during maintenance windows.
- Configure dynamic thresholds that adjust for expected traffic surges during promotional events.
- Define escalation paths when error budget depletion exceeds 80% in a billing cycle.
- Implement circuit breaker logic that halts deployments when availability dips below a team-defined threshold.
- Coordinate SLO resets after major version releases, acknowledging temporary instability periods.
- Document exceptions for planned feature flag rollouts that may trigger false availability degradation signals.
Module 5: Alerting and Incident Response Integration
- Map availability thresholds to alert severity levels (e.g., warning at 99.0%, critical at 98.0%) in monitoring systems.
- Configure deduplication rules to prevent alert storms during cascading failures affecting multiple dependent services.
- Integrate availability alerts with incident management platforms to auto-populate incident timelines.
- Define conditions under which degraded availability triggers automatic failover to backup regions.
- Establish on-call rotation rules based on availability SLA criticality and historical incident frequency.
- Implement alert suppression windows during pre-approved maintenance activities with audit logging.
- Validate that alert notifications include contextual data such as recent deployment history and dependency status.
Module 6: Data Storage and Retention Policies
- Select time-series database retention periods based on regulatory requirements and trend analysis needs.
- Partition availability data by service, region, and customer segment to optimize query performance for audits.
- Implement data tiering strategies that move older availability records to lower-cost storage after 90 days.
- Define access controls for availability data to restrict sensitive performance information to authorized roles.
- Design export mechanisms to generate availability reports in formats required by external auditors.
- Enforce encryption at rest and in transit for all availability telemetry, including backup snapshots.
- Validate backup integrity for availability databases through periodic restore drills.
Module 7: Cross-Functional Reporting and Accountability
- Produce executive-level availability dashboards that aggregate data across product lines without exposing operational details.
- Reconcile discrepancies between finance-reported uptime (for billing) and engineering-reported availability.
- Assign ownership tags to availability metrics to hold teams accountable for SLO breaches.
- Integrate availability KPIs into quarterly business reviews with product and engineering leadership.
- Develop root cause classification taxonomies to identify recurring failure patterns across teams.
- Implement blameless postmortem tracking to correlate availability incidents with process improvements.
- Coordinate with legal to ensure availability reporting complies with service contract disclosure requirements.
Module 8: Continuous Improvement and Benchmarking
- Conduct quarterly reviews of availability KPIs to identify services requiring architectural investment.
- Compare internal availability benchmarks against industry standards for similar service types.
- Update monitoring configurations based on lessons learned from major incident retrospectives.
- Refine synthetic transaction scripts to reflect evolving user behavior and feature usage.
- Adjust SLOs upward after sustained periods of overperformance, subject to business approval.
- Implement automated anomaly detection to identify subtle availability degradation before threshold breaches.
- Integrate availability trends into capacity planning models to prevent resource exhaustion outages.
Module 9: Regulatory and Audit Compliance
- Document data sources and calculation methodologies for availability metrics to support external audits.
- Preserve raw availability logs for the minimum duration required by financial or healthcare regulations.
- Implement role-based access logging for all queries and modifications to availability data stores.
- Validate that third-party monitoring providers comply with data sovereignty requirements in each operating region.
- Prepare availability reports in standardized formats (e.g., SOC 2, ISO 27001) upon auditor request.
- Conduct internal mock audits to test the completeness and accuracy of availability data retrieval.
- Update compliance documentation whenever monitoring tools or KPI definitions are modified.