Description

This curriculum spans the full lifecycle of service outage management—from definition and detection to reporting and governance—with a scope comparable to a multi-workshop operational readiness program for enterprise SRE and service management teams.

Module 1: Defining and Classifying Service Outages

Selecting outage classification criteria based on business impact, duration, and affected components rather than technical root cause alone.
Implementing a standardized taxonomy for outages (e.g., partial, complete, regional, cascading) to ensure consistent incident reporting across teams.
Deciding whether to include planned maintenance windows as outages in SLA calculations, considering customer expectations and operational realities.
Establishing thresholds for what constitutes a reportable outage, balancing signal versus noise in monitoring systems.
Aligning outage definitions with legal and contractual obligations, particularly in regulated industries where reporting triggers are mandatory.
Integrating outage classification into incident management workflows to ensure accurate tagging during real-time response.

Module 2: Measuring Availability and Downtime

Calculating availability using precise timestamps from monitoring systems, excluding false positives caused by probe failures.
Choosing between uptime percentage and downtime minutes as the primary metric based on service criticality and customer reporting needs.
Handling edge cases such as intermittent outages lasting seconds—determining whether to include them in SLA breach calculations.
Implementing time-zone-aware outage tracking for globally distributed services to avoid misalignment in reporting windows.
Reconciling discrepancies between synthetic monitoring data and real user monitoring (RUM) when calculating effective downtime.
Designing automated data pipelines to aggregate outage duration from multiple sources without double-counting overlapping incidents.

Module 3: SLA Design and Outage Inclusion Criteria

Deciding which services and components are in scope for SLAs, especially when dependencies on third-party providers exist.
Setting exclusion rules for outages caused by customer misconfiguration or unsupported usage patterns.
Defining whether upstream provider outages (e.g., cloud infrastructure) are attributable to the service provider’s SLA commitments.
Structuring SLAs with tiered availability targets based on service criticality (e.g., 99.9% vs. 99.99%).
Documenting and versioning SLA terms to ensure auditability and consistency during dispute resolution.
Aligning SLA measurement intervals (e.g., monthly, quarterly) with billing cycles and customer review cadences.

Module 4: Incident Detection and Outage Verification

Configuring alerting thresholds to minimize false positives while ensuring timely detection of actual service degradation.
Implementing cross-system correlation to distinguish isolated failures from broader service outages.
Validating outage status using multiple monitoring sources before initiating SLA tracking procedures.
Designing escalation paths that trigger based on outage duration and impact level, not just alert volume.
Integrating status page updates with outage verification workflows to prevent premature public disclosure.
Using automated health checks to confirm service restoration before closing outage records.

Module 5: Root Cause Analysis and Post-Outage Review

Conducting blameless post-mortems that focus on systemic issues rather than individual accountability.
Standardizing root cause categorization (e.g., deployment error, configuration drift, capacity exhaustion) for trend analysis.
Deciding which outages require full RCA based on business impact, recurrence, or customer escalation.
Integrating RCA findings into change management processes to prevent recurrence of similar outages.
Archiving RCA reports with metadata to support audits and regulatory compliance requirements.
Sharing RCA summaries with stakeholders while redacting sensitive operational details.

Module 6: Reporting, Transparency, and Customer Communication

Generating SLA compliance reports that differentiate between planned and unplanned outages for customer review.
Designing public status dashboards with real-time outage indicators while controlling disclosure of sensitive details.
Deciding when to proactively notify customers of outages based on severity, duration, and contractual obligations.
Standardizing communication templates for outage updates to ensure consistency across support and engineering teams.
Handling discrepancies between internal outage records and customer-reported downtime claims.
Archiving all customer communications related to outages for legal and compliance purposes.

Module 7: Continuous Improvement and SLA Governance

Reviewing SLA performance trends quarterly to identify services requiring architectural investment or deprecation.
Adjusting SLA targets based on evolving business requirements, not just historical performance.
Enforcing change advisory board (CAB) reviews for modifications to SLA-critical systems.
Implementing automated compliance checks to ensure new services meet baseline availability standards before launch.
Tracking SLA breach frequency across vendors and internal teams to inform sourcing and accountability decisions.
Conducting annual SLA governance audits to verify data accuracy, process adherence, and policy alignment.