Description

This curriculum spans the full lifecycle of power-related incidents, equivalent to a multi-workshop program that integrates facility operations, IT resilience, and cross-functional coordination typically managed through joint incident response and infrastructure advisory efforts in large organisations.

Module 1: Defining Critical Systems and Failure Thresholds

Establishing RTOs and RPOs for power-dependent systems based on business impact analysis across departments
Classifying applications into tiers (e.g., Tier 0 for life-safety systems, Tier 1 for revenue-generating platforms) during outage planning
Documenting dependencies between IT systems and facility infrastructure (HVAC, elevators, access control) in outage scenarios
Deciding which systems receive UPS or generator support when capacity is constrained
Integrating physical security systems into incident response plans when power loss disables biometric access
Mapping data replication paths to ensure failover systems remain accessible during extended outages

Module 2: Incident Detection and Alerting Protocols

Configuring SNMP traps and environmental sensors to trigger alerts on power anomalies before full failure
Setting escalation thresholds for power alerts to prevent alert fatigue during brownout conditions
Integrating building management systems (BMS) with IT monitoring tools for correlated event detection
Validating alert delivery paths (SMS, email, voice) when primary network infrastructure is compromised
Implementing heartbeat monitoring for backup generators and UPS systems to detect silent failures
Defining false-positive thresholds for automatic incident initiation during transient power fluctuations

Module 3: Communication and Stakeholder Coordination

Activating pre-approved communication templates for executive, employee, and customer audiences during escalating outages
Assigning communication ownership to specific roles when normal collaboration tools (email, Teams) are unavailable
Using satellite phones or LTE hotspots to maintain external comms when cellular networks degrade
Coordinating with utility providers to obtain estimated restoration times and validate outage scope
Logging all stakeholder interactions in the incident management system for post-mortem analysis
Managing legal and regulatory disclosure obligations when outages impact SLAs or data availability

Module 4: Operational Failover and System Recovery

Executing failover runbooks for database clusters while ensuring transaction consistency across sites
Validating generator auto-start sequences and fuel levels during transition from utility power
Initiating cold-site activation procedures when primary and secondary data centers lose power
Managing DNS TTL settings in advance to enable rapid redirection to backup environments
Assessing data integrity after abrupt shutdowns using filesystem journaling and checksum verification
Delaying non-critical service restarts to prioritize power allocation during generator runtime

Module 5: On-Site Response and Facility Management

Dispatching facility engineers to inspect transfer switches and ATS logs during power transfer events
Deploying portable lighting and temporary power to maintain safety in server rooms and control centers
Enforcing physical access logs when electronic badge systems fail due to power loss
Coordinating with fire marshals when emergency lighting or egress systems are affected by outage
Monitoring server inlet temperatures during cooling system failure to prevent thermal shutdowns
Documenting equipment damage from power surges or improper shutdowns for insurance claims

Module 6: Post-Outage Restoration and Validation

Verifying stable utility power before initiating transfer back from generator to grid
Staggering system restarts to avoid inrush current overloads on restored circuits
Validating transaction reconciliation between primary and backup systems after failback
Conducting filesystem and database consistency checks before resuming production operations
Updating asset inventories to reflect hardware replaced due to power-related damage
Re-synchronizing time across systems using NTP after clock drift during outage

Module 7: Incident Review and Resilience Improvement

Conducting blameless post-mortems to identify single points of failure in power architecture
Updating runbooks based on observed gaps in response timing or role execution
Revising generator maintenance schedules after performance issues during actual events
Re-evaluating UPS battery replacement cycles based on runtime during recent outages
Adjusting monitoring thresholds to reflect actual power behavior observed during incidents
Proposing capital upgrades (e.g., dual utility feeds, additional fuel storage) based on outage frequency and impact