Description

This curriculum spans the design and operation of incident management practices with the granularity of a multi-workshop program, covering detection, response, and organisational learning activities comparable to those in enterprise advisory engagements focused on resilience and operational alignment.

Module 1: Defining and Classifying Service Downtime

Selecting criteria for distinguishing between planned maintenance, unplanned outages, and partial degradation in service availability.
Implementing a standardized downtime taxonomy aligned with business service tiers and SLA classifications.
Deciding whether to track downtime by system, service, or user impact to align with incident reporting requirements.
Integrating business context into downtime definitions, such as distinguishing between peak and off-peak outages.
Resolving conflicts between infrastructure teams (measuring uptime) and business units (measuring usability) in defining downtime.
Documenting exceptions for acceptable downtime windows, including scheduled maintenance and third-party dependencies.

Module 2: Monitoring and Detection Architecture

Designing synthetic transaction checks to detect functional downtime versus network-level availability.
Configuring alert thresholds to avoid false positives while ensuring timely detection of partial outages.
Choosing between agent-based and agentless monitoring for critical services based on security and coverage trade-offs.
Implementing distributed monitoring probes to detect regional or location-specific outages.
Integrating business transaction monitoring with infrastructure telemetry to correlate technical failures with service impact.
Managing alert fatigue by applying dynamic noise suppression rules during known maintenance windows.

Module 3: Incident Response and Escalation Protocols

Establishing clear ownership for initial triage based on service ownership maps during multi-system outages.
Activating war room procedures only when downtime exceeds predefined business impact thresholds.
Coordinating communication between NOC, DevOps, and application support teams during overlapping incident scopes.
Documenting real-time incident timelines to support post-mortem analysis and regulatory reporting.
Enforcing escalation paths when resolution stalls beyond SLA breach thresholds.
Managing external vendor involvement when third-party services contribute to downtime.

Module 4: Root Cause Analysis and Post-Incident Review

Selecting between timeline-based, fishbone, and five whys methodologies based on incident complexity.
Ensuring participation from all relevant technical teams in blameless post-mortems without delaying service restoration.
Identifying whether root causes are technical (e.g., configuration drift), process-related (e.g., change approval gaps), or human-factor based.
Classifying contributing factors such as alert desensitization, documentation gaps, or insufficient failover testing.
Deciding which findings require formal action items versus informational updates to runbooks.
Archiving incident records in a searchable knowledge base to support trend analysis and compliance audits.

Module 5: Change Management and Downtime Prevention

Requiring downtime impact assessments for all standard, normal, and emergency changes in the change advisory board (CAB) process.
Enforcing pre-implementation validation steps such as configuration backups and rollback testing for high-risk changes.
Blocking unauthorized changes during critical business periods using automated change freeze policies.
Integrating deployment pipelines with incident management systems to flag recent changes during outage triage.
Assessing whether peer review requirements for code and configuration changes are sufficient to prevent regression failures.
Managing emergency change approvals while maintaining audit trail completeness and retrospective review.

Module 6: High Availability and Resilience Design

Designing active-passive versus active-active architectures based on RTO and RPO requirements for critical services.
Validating failover mechanisms through scheduled, controlled disruption tests without impacting production users.
Allocating redundancy at the right layer—network, server, data, or application—based on failure domain analysis.
Implementing circuit breakers and graceful degradation features to minimize user-facing downtime during partial failures.
Assessing cost-benefit trade-offs of multi-region deployments versus localized redundancy for non-critical systems.
Updating disaster recovery runbooks to reflect current system dependencies and credential access paths.

Module 7: Downtime Reporting and Business Alignment

Generating uptime reports segmented by business unit, geography, and customer segment to reflect actual service impact.
Reconciling system-generated uptime metrics with business-reported outage experiences to identify detection gaps.
Presenting downtime data to executive stakeholders using business outcome metrics instead of technical availability percentages.
Aligning SLA reporting periods with financial or operational reporting cycles for consistency in performance reviews.
Managing disputes over downtime attribution when multiple systems contribute to a single service disruption.
Updating service catalogs and dependency maps to ensure accurate impact assessment in future reporting cycles.

Module 8: Continuous Improvement and Organizational Learning

Prioritizing remediation efforts based on recurrence frequency and business impact of past downtime events.
Integrating incident metrics into team performance dashboards without creating perverse incentives around incident suppression.
Conducting trend analysis to identify systemic issues such as recurring configuration errors or tooling gaps.
Updating training materials for support teams based on common misdiagnoses observed in past incidents.
Rotating incident management roles during drills to build cross-functional response capability.
Assessing maturity of incident processes using frameworks like ITIL or SRE without over-bureaucratizing operations.