This curriculum spans the full lifecycle of service outage management—from definition and detection to reporting and governance—with a scope comparable to a multi-workshop operational readiness program for enterprise SRE and service management teams.
Module 1: Defining and Classifying Service Outages
- Selecting outage classification criteria based on business impact, duration, and affected components rather than technical root cause alone.
- Implementing a standardized taxonomy for outages (e.g., partial, complete, regional, cascading) to ensure consistent incident reporting across teams.
- Deciding whether to include planned maintenance windows as outages in SLA calculations, considering customer expectations and operational realities.
- Establishing thresholds for what constitutes a reportable outage, balancing signal versus noise in monitoring systems.
- Aligning outage definitions with legal and contractual obligations, particularly in regulated industries where reporting triggers are mandatory.
- Integrating outage classification into incident management workflows to ensure accurate tagging during real-time response.
Module 2: Measuring Availability and Downtime
- Calculating availability using precise timestamps from monitoring systems, excluding false positives caused by probe failures.
- Choosing between uptime percentage and downtime minutes as the primary metric based on service criticality and customer reporting needs.
- Handling edge cases such as intermittent outages lasting seconds—determining whether to include them in SLA breach calculations.
- Implementing time-zone-aware outage tracking for globally distributed services to avoid misalignment in reporting windows.
- Reconciling discrepancies between synthetic monitoring data and real user monitoring (RUM) when calculating effective downtime.
- Designing automated data pipelines to aggregate outage duration from multiple sources without double-counting overlapping incidents.
Module 3: SLA Design and Outage Inclusion Criteria
- Deciding which services and components are in scope for SLAs, especially when dependencies on third-party providers exist.
- Setting exclusion rules for outages caused by customer misconfiguration or unsupported usage patterns.
- Defining whether upstream provider outages (e.g., cloud infrastructure) are attributable to the service provider’s SLA commitments.
- Structuring SLAs with tiered availability targets based on service criticality (e.g., 99.9% vs. 99.99%).
- Documenting and versioning SLA terms to ensure auditability and consistency during dispute resolution.
- Aligning SLA measurement intervals (e.g., monthly, quarterly) with billing cycles and customer review cadences.
Module 4: Incident Detection and Outage Verification
- Configuring alerting thresholds to minimize false positives while ensuring timely detection of actual service degradation.
- Implementing cross-system correlation to distinguish isolated failures from broader service outages.
- Validating outage status using multiple monitoring sources before initiating SLA tracking procedures.
- Designing escalation paths that trigger based on outage duration and impact level, not just alert volume.
- Integrating status page updates with outage verification workflows to prevent premature public disclosure.
- Using automated health checks to confirm service restoration before closing outage records.
Module 5: Root Cause Analysis and Post-Outage Review
- Conducting blameless post-mortems that focus on systemic issues rather than individual accountability.
- Standardizing root cause categorization (e.g., deployment error, configuration drift, capacity exhaustion) for trend analysis.
- Deciding which outages require full RCA based on business impact, recurrence, or customer escalation.
- Integrating RCA findings into change management processes to prevent recurrence of similar outages.
- Archiving RCA reports with metadata to support audits and regulatory compliance requirements.
- Sharing RCA summaries with stakeholders while redacting sensitive operational details.
Module 6: Reporting, Transparency, and Customer Communication
- Generating SLA compliance reports that differentiate between planned and unplanned outages for customer review.
- Designing public status dashboards with real-time outage indicators while controlling disclosure of sensitive details.
- Deciding when to proactively notify customers of outages based on severity, duration, and contractual obligations.
- Standardizing communication templates for outage updates to ensure consistency across support and engineering teams.
- Handling discrepancies between internal outage records and customer-reported downtime claims.
- Archiving all customer communications related to outages for legal and compliance purposes.
Module 7: Continuous Improvement and SLA Governance
- Reviewing SLA performance trends quarterly to identify services requiring architectural investment or deprecation.
- Adjusting SLA targets based on evolving business requirements, not just historical performance.
- Enforcing change advisory board (CAB) reviews for modifications to SLA-critical systems.
- Implementing automated compliance checks to ensure new services meet baseline availability standards before launch.
- Tracking SLA breach frequency across vendors and internal teams to inform sourcing and accountability decisions.
- Conducting annual SLA governance audits to verify data accuracy, process adherence, and policy alignment.