This curriculum spans the breadth of a multi-workshop capacity management program, covering the technical, operational, and governance practices found in mature enterprise environments with hybrid infrastructure and formal IT service management frameworks.
Module 1: Foundational Principles of Capacity Management
- Selecting between reactive and proactive capacity planning based on historical incident patterns and business tolerance for service degradation.
- Defining service capacity units (e.g., transactions per second, concurrent users) that align with business-critical workloads and technical monitoring capabilities.
- Establishing thresholds for performance degradation that trigger capacity reviews, balancing sensitivity with operational noise.
- Integrating capacity planning into ITIL service lifecycle phases, particularly service design and continual service improvement.
- Mapping application dependencies to infrastructure tiers to identify capacity bottlenecks beyond isolated component metrics.
- Documenting assumptions about growth rates and workload behavior used in long-term capacity forecasts.
Module 2: Demand Forecasting and Workload Modeling
- Choosing between time-series forecasting models (e.g., ARIMA, exponential smoothing) based on data availability and seasonality patterns.
- Adjusting baseline forecasts for one-time business events such as product launches or marketing campaigns using historical analog data.
- Segmenting user populations by behavior (e.g., peak usage times, transaction volume) to model differentiated demand profiles.
- Validating forecast accuracy quarterly by comparing predicted vs. actual utilization and recalibrating models accordingly.
- Modeling workload elasticity for cloud-native applications, including auto-scaling lag and cold-start impacts.
- Documenting confidence intervals around projections to inform risk-based infrastructure investment decisions.
Module 3: Performance Baselines and Monitoring Integration
- Configuring monitoring tools to collect capacity-relevant metrics at appropriate granularities (e.g., 5-minute intervals for CPU, daily for storage).
- Distinguishing between performance bottlenecks and capacity constraints using wait-time analysis and queue depth metrics.
- Establishing dynamic baselines that adapt to normal operational variance, reducing false-positive alerts.
- Correlating infrastructure utilization (e.g., memory, I/O) with application-level KPIs to identify inefficient resource consumption.
- Setting up synthetic transaction monitoring to measure end-to-end capacity under controlled load conditions.
- Archiving performance data for at least two business cycles to support trend analysis and audit requirements.
Module 4: Infrastructure Sizing and Right-Sizing Strategies
- Calculating required compute capacity using workload benchmarks and vendor-provided performance data, adjusted for virtualization overhead.
- Right-sizing over-provisioned VMs based on utilization trends, considering application memory footprints and burst requirements.
- Evaluating the trade-off between vertical and horizontal scaling for stateful applications with persistent storage dependencies.
- Assessing the impact of container density on node-level contention for CPU, memory, and network bandwidth.
- Planning storage capacity with consideration for growth, retention policies, and backup overhead (e.g., 3x for daily snapshots).
- Documenting sizing assumptions and validation methods for audit and handover to operations teams.
Module 5: Cloud and Hybrid Capacity Management
- Determining optimal reservation models (e.g., Reserved Instances, Savings Plans) based on workload stability and usage duration.
- Designing auto-scaling policies that respond to queue length or request latency, not just CPU utilization.
- Managing cross-region failover capacity requirements, including DNS TTL and data replication lag implications.
- Monitoring egress costs as a capacity constraint in public cloud environments with high data transfer volumes.
- Implementing tagging and chargeback mechanisms to attribute cloud spend to business units for capacity accountability.
- Planning for cloud provider quota limits and request throttling during peak scaling events.
Module 6: Capacity Governance and Financial Alignment
- Establishing capacity review boards to approve infrastructure changes exceeding predefined utilization or cost thresholds.
- Aligning capacity budgets with fiscal planning cycles and securing multi-year funding for long-lead hardware.
- Defining service level objectives (SLOs) that include capacity headroom targets (e.g., 70% max CPU during peak).
- Negotiating hardware refresh cycles with vendors based on support lifecycle and performance degradation data.
- Conducting quarterly capacity audits to validate alignment between allocated, utilized, and reserved resources.
- Integrating capacity risk assessments into enterprise risk management frameworks for audit compliance.
Module 7: Scenario Planning and Stress Testing
- Designing load tests that simulate peak business scenarios (e.g., end-of-month processing) using production-like data.
- Executing failover capacity tests to validate standby environment readiness under full production load.
- Modeling the impact of third-party service degradation on internal capacity requirements (e.g., API rate limiting).
- Using chaos engineering techniques to expose hidden capacity dependencies and single points of failure.
- Documenting recovery time objectives (RTO) and recovery point objectives (RPO) under constrained capacity conditions.
- Updating capacity models based on test results, particularly when observed saturation occurs below projected thresholds.
Module 8: Continuous Improvement and Automation
- Implementing automated capacity alerts with root cause templates to accelerate investigation workflows.
- Developing scripts to generate monthly capacity reports from monitoring and CMDB data, reducing manual effort.
- Integrating capacity data into incident management systems to correlate outages with resource exhaustion.
- Using machine learning models to detect anomalous usage patterns that may indicate misconfigurations or security incidents.
- Automating VM decommissioning workflows based on sustained low utilization and lack of dependency links.
- Establishing feedback loops between capacity planning and development teams to influence application efficiency during design.