This curriculum spans the operational rigor of a multi-workshop infrastructure governance program, addressing the same technical and procedural challenges seen in enterprise advisory engagements focused on ITSM integration, service ownership, and cross-functional alignment across operations, security, and vendor management.
Module 1: Defining Infrastructure Scope and Service Boundaries
- Determine which on-premises systems fall under ITSM ownership versus those managed by specialized engineering teams (e.g., network backbone vs. application middleware).
- Establish service boundary agreements with cloud providers to clarify responsibilities for monitoring, patching, and incident response for IaaS components.
- Map infrastructure components to business services to prioritize monitoring and maintenance efforts based on business impact.
- Decide whether virtual machines in development environments require full CMDB tracking or can be excluded based on risk tolerance.
- Integrate edge computing devices (e.g., IoT gateways) into the infrastructure management framework, including asset tagging and lifecycle tracking.
- Negotiate ownership of shared infrastructure (e.g., load balancers, DNS servers) between infrastructure and security teams to prevent operational gaps.
Module 2: Configuration Management Database (CMDB) Governance
- Define CI (Configuration Item) attributes for network switches, including location, firmware version, and support contract expiration.
- Implement automated discovery tooling while configuring exclusion rules to prevent shadow IT systems from polluting the CMDB.
- Enforce CI ownership rules requiring system administrators to validate and approve CMDB entries during change implementation.
- Resolve conflicting data from multiple discovery sources by establishing a reconciliation process with defined precedence rules.
- Design audit workflows to verify CMDB accuracy quarterly, focusing on high-impact services and recently decommissioned assets.
- Integrate CMDB with vulnerability management systems to trigger alerts when unpatched CIs are detected in production.
Module 3: Change Enablement and Risk Assessment
- Classify infrastructure changes (standard, normal, emergency) based on impact, urgency, and technical complexity using documented criteria.
- Require rollback plans for all non-standard changes to core network infrastructure, including pre-tested configuration backups.
- Coordinate change windows with application teams to minimize disruption during batch processing or peak usage periods.
- Implement peer review requirements for firewall rule changes, requiring at least one network engineer to validate proposed access controls.
- Use change risk scoring models that factor in CI criticality, change history, and implementation team experience.
- Log all emergency changes in the change system within 24 hours and schedule post-implementation reviews to assess justification.
Module 4: Incident Management for Infrastructure Outages
- Define escalation paths for infrastructure incidents based on service degradation thresholds (e.g., latency >500ms for 5 minutes).
- Integrate monitoring alerts with incident management tools using correlation rules to prevent alert storms during cascading failures.
- Assign incident commanders for major outages, ensuring one individual has authority to coordinate cross-team response efforts.
- Document known error databases for recurring infrastructure issues, such as NIC driver failures on specific server models.
- Conduct blameless post-mortems for all P1 incidents, focusing on process gaps rather than individual accountability.
- Validate incident resolution by confirming performance metrics return to baseline, not just service availability.
Module 5: Proactive Monitoring and Performance Tuning
- Set threshold-based alerts for storage array latency, distinguishing between transient spikes and sustained degradation.
- Deploy synthetic transactions to monitor end-to-end performance of critical infrastructure paths (e.g., AD authentication response).
- Balance monitoring coverage with system overhead by limiting agent-based collection on high-throughput database servers.
- Use baselining techniques to detect anomalous behavior in virtualized environments, such as VM sprawl or memory ballooning.
- Configure dependency mapping in monitoring tools to suppress downstream alerts when a root cause is identified.
- Rotate monitoring responsibilities across shifts to ensure 24/7 operational knowledge and reduce tribal dependency.
Module 6: Lifecycle Management and Decommissioning
- Initiate hardware refresh cycles based on vendor support timelines and failure rate trends from historical incident data.
- Verify data sanitization on decommissioned storage devices using NIST 800-88 standards before physical disposal.
- Update service documentation and runbooks when retiring legacy systems to reflect current supported configurations.
- Coordinate with procurement to align end-of-support dates with budget cycles for replacement planning.
- Archive configuration backups and logs for decommissioned systems according to regulatory retention policies.
- Conduct stakeholder reviews before decommissioning to confirm no downstream dependencies remain active.
Module 7: Integration with Security and Compliance Frameworks
- Enforce infrastructure hardening standards by integrating configuration compliance checks into the change approval workflow.
- Share CMDB data with vulnerability scanners to prioritize patching based on asset criticality and exposure.
- Implement just-in-time access for administrative privileges on production infrastructure using PAM solutions.
- Generate audit trails for all privileged infrastructure actions, ensuring logs are immutable and centrally stored.
- Align infrastructure change blackout periods with PCI DSS audit windows to reduce configuration drift risk.
- Map infrastructure controls to compliance frameworks (e.g., ISO 27001, SOC 2) to streamline evidence collection.
Module 8: Vendor and Contract Management for Infrastructure Services
- Negotiate SLAs with hardware vendors that include penalties for missed on-site response times for critical failures.
- Track warranty expirations in the asset management system to trigger procurement of extended support contracts.
- Validate vendor-provided runbooks against internal operational procedures before accepting managed services.
- Require third-party providers to integrate their monitoring systems with the enterprise event management platform.
- Conduct quarterly business reviews with infrastructure vendors to assess performance against KPIs and resolve recurring issues.
- Enforce right-to-audit clauses in contracts to verify compliance with security and operational commitments.