Description

This curriculum spans the breadth of a multi-workshop capacity management program, covering the technical, operational, and governance practices found in mature enterprise environments with hybrid infrastructure and formal IT service management frameworks.

Module 1: Foundational Principles of Capacity Management

Selecting between reactive and proactive capacity planning based on historical incident patterns and business tolerance for service degradation.
Defining service capacity units (e.g., transactions per second, concurrent users) that align with business-critical workloads and technical monitoring capabilities.
Establishing thresholds for performance degradation that trigger capacity reviews, balancing sensitivity with operational noise.
Integrating capacity planning into ITIL service lifecycle phases, particularly service design and continual service improvement.
Mapping application dependencies to infrastructure tiers to identify capacity bottlenecks beyond isolated component metrics.
Documenting assumptions about growth rates and workload behavior used in long-term capacity forecasts.

Module 2: Demand Forecasting and Workload Modeling

Choosing between time-series forecasting models (e.g., ARIMA, exponential smoothing) based on data availability and seasonality patterns.
Adjusting baseline forecasts for one-time business events such as product launches or marketing campaigns using historical analog data.
Segmenting user populations by behavior (e.g., peak usage times, transaction volume) to model differentiated demand profiles.
Validating forecast accuracy quarterly by comparing predicted vs. actual utilization and recalibrating models accordingly.
Modeling workload elasticity for cloud-native applications, including auto-scaling lag and cold-start impacts.
Documenting confidence intervals around projections to inform risk-based infrastructure investment decisions.

Module 3: Performance Baselines and Monitoring Integration

Configuring monitoring tools to collect capacity-relevant metrics at appropriate granularities (e.g., 5-minute intervals for CPU, daily for storage).
Distinguishing between performance bottlenecks and capacity constraints using wait-time analysis and queue depth metrics.
Establishing dynamic baselines that adapt to normal operational variance, reducing false-positive alerts.
Correlating infrastructure utilization (e.g., memory, I/O) with application-level KPIs to identify inefficient resource consumption.
Setting up synthetic transaction monitoring to measure end-to-end capacity under controlled load conditions.
Archiving performance data for at least two business cycles to support trend analysis and audit requirements.

Module 4: Infrastructure Sizing and Right-Sizing Strategies

Calculating required compute capacity using workload benchmarks and vendor-provided performance data, adjusted for virtualization overhead.
Right-sizing over-provisioned VMs based on utilization trends, considering application memory footprints and burst requirements.
Evaluating the trade-off between vertical and horizontal scaling for stateful applications with persistent storage dependencies.
Assessing the impact of container density on node-level contention for CPU, memory, and network bandwidth.
Planning storage capacity with consideration for growth, retention policies, and backup overhead (e.g., 3x for daily snapshots).
Documenting sizing assumptions and validation methods for audit and handover to operations teams.

Module 5: Cloud and Hybrid Capacity Management

Determining optimal reservation models (e.g., Reserved Instances, Savings Plans) based on workload stability and usage duration.
Designing auto-scaling policies that respond to queue length or request latency, not just CPU utilization.
Managing cross-region failover capacity requirements, including DNS TTL and data replication lag implications.
Monitoring egress costs as a capacity constraint in public cloud environments with high data transfer volumes.
Implementing tagging and chargeback mechanisms to attribute cloud spend to business units for capacity accountability.
Planning for cloud provider quota limits and request throttling during peak scaling events.

Module 6: Capacity Governance and Financial Alignment

Establishing capacity review boards to approve infrastructure changes exceeding predefined utilization or cost thresholds.
Aligning capacity budgets with fiscal planning cycles and securing multi-year funding for long-lead hardware.
Defining service level objectives (SLOs) that include capacity headroom targets (e.g., 70% max CPU during peak).
Negotiating hardware refresh cycles with vendors based on support lifecycle and performance degradation data.
Conducting quarterly capacity audits to validate alignment between allocated, utilized, and reserved resources.
Integrating capacity risk assessments into enterprise risk management frameworks for audit compliance.

Module 7: Scenario Planning and Stress Testing

Designing load tests that simulate peak business scenarios (e.g., end-of-month processing) using production-like data.
Executing failover capacity tests to validate standby environment readiness under full production load.
Modeling the impact of third-party service degradation on internal capacity requirements (e.g., API rate limiting).
Using chaos engineering techniques to expose hidden capacity dependencies and single points of failure.
Documenting recovery time objectives (RTO) and recovery point objectives (RPO) under constrained capacity conditions.
Updating capacity models based on test results, particularly when observed saturation occurs below projected thresholds.

Module 8: Continuous Improvement and Automation

Implementing automated capacity alerts with root cause templates to accelerate investigation workflows.
Developing scripts to generate monthly capacity reports from monitoring and CMDB data, reducing manual effort.
Integrating capacity data into incident management systems to correlate outages with resource exhaustion.
Using machine learning models to detect anomalous usage patterns that may indicate misconfigurations or security incidents.
Automating VM decommissioning workflows based on sustained low utilization and lack of dependency links.
Establishing feedback loops between capacity planning and development teams to influence application efficiency during design.