This curriculum spans the full lifecycle of power-related incidents, equivalent to a multi-workshop program that integrates facility operations, IT resilience, and cross-functional coordination typically managed through joint incident response and infrastructure advisory efforts in large organisations.
Module 1: Defining Critical Systems and Failure Thresholds
- Establishing RTOs and RPOs for power-dependent systems based on business impact analysis across departments
- Classifying applications into tiers (e.g., Tier 0 for life-safety systems, Tier 1 for revenue-generating platforms) during outage planning
- Documenting dependencies between IT systems and facility infrastructure (HVAC, elevators, access control) in outage scenarios
- Deciding which systems receive UPS or generator support when capacity is constrained
- Integrating physical security systems into incident response plans when power loss disables biometric access
- Mapping data replication paths to ensure failover systems remain accessible during extended outages
Module 2: Incident Detection and Alerting Protocols
- Configuring SNMP traps and environmental sensors to trigger alerts on power anomalies before full failure
- Setting escalation thresholds for power alerts to prevent alert fatigue during brownout conditions
- Integrating building management systems (BMS) with IT monitoring tools for correlated event detection
- Validating alert delivery paths (SMS, email, voice) when primary network infrastructure is compromised
- Implementing heartbeat monitoring for backup generators and UPS systems to detect silent failures
- Defining false-positive thresholds for automatic incident initiation during transient power fluctuations
Module 3: Communication and Stakeholder Coordination
- Activating pre-approved communication templates for executive, employee, and customer audiences during escalating outages
- Assigning communication ownership to specific roles when normal collaboration tools (email, Teams) are unavailable
- Using satellite phones or LTE hotspots to maintain external comms when cellular networks degrade
- Coordinating with utility providers to obtain estimated restoration times and validate outage scope
- Logging all stakeholder interactions in the incident management system for post-mortem analysis
- Managing legal and regulatory disclosure obligations when outages impact SLAs or data availability
Module 4: Operational Failover and System Recovery
- Executing failover runbooks for database clusters while ensuring transaction consistency across sites
- Validating generator auto-start sequences and fuel levels during transition from utility power
- Initiating cold-site activation procedures when primary and secondary data centers lose power
- Managing DNS TTL settings in advance to enable rapid redirection to backup environments
- Assessing data integrity after abrupt shutdowns using filesystem journaling and checksum verification
- Delaying non-critical service restarts to prioritize power allocation during generator runtime
Module 5: On-Site Response and Facility Management
- Dispatching facility engineers to inspect transfer switches and ATS logs during power transfer events
- Deploying portable lighting and temporary power to maintain safety in server rooms and control centers
- Enforcing physical access logs when electronic badge systems fail due to power loss
- Coordinating with fire marshals when emergency lighting or egress systems are affected by outage
- Monitoring server inlet temperatures during cooling system failure to prevent thermal shutdowns
- Documenting equipment damage from power surges or improper shutdowns for insurance claims
Module 6: Post-Outage Restoration and Validation
- Verifying stable utility power before initiating transfer back from generator to grid
- Staggering system restarts to avoid inrush current overloads on restored circuits
- Validating transaction reconciliation between primary and backup systems after failback
- Conducting filesystem and database consistency checks before resuming production operations
- Updating asset inventories to reflect hardware replaced due to power-related damage
- Re-synchronizing time across systems using NTP after clock drift during outage
Module 7: Incident Review and Resilience Improvement
- Conducting blameless post-mortems to identify single points of failure in power architecture
- Updating runbooks based on observed gaps in response timing or role execution
- Revising generator maintenance schedules after performance issues during actual events
- Re-evaluating UPS battery replacement cycles based on runtime during recent outages
- Adjusting monitoring thresholds to reflect actual power behavior observed during incidents
- Proposing capital upgrades (e.g., dual utility feeds, additional fuel storage) based on outage frequency and impact