This curriculum spans the technical, operational, and governance dimensions of capacity planning, comparable in scope to a multi-workshop program embedded within an enterprise’s internal reliability engineering function, addressing real-world decisions from infrastructure sizing to incident response.
Module 1: Defining Capacity Requirements and Demand Forecasting
- Selecting between time-series forecasting models and regression-based demand projections based on historical data availability and business volatility.
- Establishing service-level thresholds that trigger capacity planning reviews, such as sustained CPU utilization above 75% for 14 consecutive days.
- Integrating input from sales, product, and finance teams to align capacity forecasts with projected customer growth and feature launches.
- Deciding whether to use peak, average, or percentile-based metrics (e.g., 95th percentile) when sizing infrastructure needs.
- Adjusting forecast models to account for seasonality, such as end-of-quarter surges or holiday traffic in e-commerce platforms.
- Documenting assumptions in demand models to enable auditability and recalibration during post-mortems or capacity incidents.
Module 2: Infrastructure Sizing and Resource Allocation
- Choosing between vertical and horizontal scaling strategies based on application architecture and failover requirements.
- Calculating redundancy requirements for high-availability systems, including active-passive versus active-active configurations.
- Allocating reserved versus on-demand compute instances based on workload predictability and cost tolerance.
- Determining memory-to-CPU ratios for database workloads using query execution patterns and buffer pool requirements.
- Right-sizing storage tiers (SSD vs. HDD) based on IOPS requirements and data access frequency.
- Validating virtual machine or container density limits to prevent noisy neighbor issues in shared environments.
Module 3: Performance Baselines and Monitoring Integration
- Defining baseline performance metrics during normal operations to detect capacity degradation over time.
- Configuring monitoring thresholds that distinguish between transient spikes and sustained capacity pressure.
- Integrating capacity metrics into existing observability platforms without overloading data ingestion pipelines.
- Selecting key performance indicators (KPIs) per system tier—such as queue depth for message brokers or p99 latency for APIs.
- Automating baseline recalibration after major system changes, such as software upgrades or architectural refactoring.
- Correlating capacity metrics with business events (e.g., marketing campaigns) to improve predictive accuracy.
Module 4: Capacity Modeling and Scenario Simulation
- Building what-if models to evaluate the impact of doubling user load on current database connection pools.
- Simulating failure scenarios to assess spare capacity availability in remaining nodes during outages.
- Using load testing results to validate model assumptions before committing to hardware procurement.
- Modeling the effect of data retention policies on storage growth and backup window expansion.
- Comparing cost and performance trade-offs between cloud bursting and permanent on-premises expansion.
- Updating simulation parameters quarterly to reflect changes in application efficiency or user behavior.
Module 5: Governance and Change Control in Capacity Decisions
- Requiring capacity impact assessments for all change requests involving high-resource services.
- Establishing approval workflows for capacity expansions that exceed predefined budget or risk thresholds.
- Documenting capacity decisions in a centralized repository to support audit and compliance requirements.
- Enforcing capacity review gates in the release management process for major application deployments.
- Assigning ownership for capacity health at the service or application level to ensure accountability.
- Coordinating capacity change schedules with maintenance windows to minimize operational disruption.
Module 6: Cloud and Hybrid Environment Capacity Strategies
- Implementing auto-scaling policies with cooldown periods to prevent thrashing during traffic oscillations.
- Managing cross-region capacity dependencies in multi-cloud architectures to avoid single points of failure.
- Tracking reserved instance utilization to avoid underuse penalties and optimize renewal cycles.
- Designing egress cost controls in cloud environments where data transfer impacts capacity economics.
- Aligning cloud provider quotas with projected needs and initiating increase requests before constraints impact operations.
- Integrating cloud cost APIs into capacity dashboards to expose financial implications of resource decisions.
Module 7: Capacity Optimization and Right-Sizing Initiatives
- Conducting quarterly resource utilization reviews to identify and decommission underused instances.
- Applying container resource limits and requests based on observed usage, not default configurations.
- Renegotiating data center power and cooling SLAs when consolidating or retiring physical servers.
- Implementing database archiving strategies to reduce active dataset size and improve query performance.
- Using A/B testing to validate performance impact after downsizing over-provisioned systems.
- Standardizing instance types across environments to simplify forecasting and reduce management overhead.
Module 8: Incident Response and Capacity-Related Outages
- Activating pre-defined surge capacity protocols during unexpected traffic spikes or denial-of-service events.
- Executing failover to secondary systems when primary capacity thresholds are breached.
- Documenting root cause of capacity-related incidents to prevent recurrence through design changes.
- Temporarily throttling non-critical services to preserve capacity for core business functions.
- Engaging procurement teams on emergency hardware or cloud credits when expansion timelines are compressed.
- Conducting blameless post-mortems to evaluate whether monitoring, forecasting, or governance gaps contributed to the incident.