This curriculum spans the technical, operational, and organisational dimensions of cloud capacity planning, equivalent in scope to a multi-workshop operational readiness program run in parallel with a cloud migration initiative, covering measurement, modelling, automation, governance, and cross-team coordination across hybrid environments.
Module 1: Assessing Current Workloads and Baseline Performance
- Decide which legacy application metrics (CPU, memory, I/O, network) to collect based on business criticality and cloud migration priority.
- Implement automated data collection from on-prem monitoring tools (e.g., Nagios, Zabbix) to build historical performance baselines.
- Determine the duration of performance data retention required for seasonal trend analysis versus cost of storage in the target cloud.
- Classify workloads by volatility (steady-state vs. bursty) to inform autoscaling design and instance selection.
- Identify dependencies between applications and databases using network flow analysis to avoid under-provisioning in interdependent systems.
- Negotiate access to production environment logs with security and operations teams while complying with change control policies.
Module 2: Cloud Sizing and Instance Selection Strategies
- Select between general-purpose, compute-optimized, or memory-optimized instance families based on application profiling data.
- Compare vCPU-to-memory ratios across AWS EC2, Azure VMs, and GCP Compute Engine to match workload requirements.
- Decide whether to use standardized instance types across environments for operational consistency or optimize per workload.
- Implement right-sizing recommendations using cloud provider tools (e.g., AWS Compute Optimizer) and validate with load testing.
- Balance the risk of over-provisioning (cost) versus under-provisioning (performance degradation) during initial migration.
- Document instance selection rationale for audit and governance purposes, especially for regulated workloads.
Module 3: Demand Forecasting and Scalability Modeling
- Apply time-series forecasting (e.g., exponential smoothing) to historical usage data to project 6- and 12-month capacity needs.
- Model peak load scenarios (e.g., end-of-month processing, marketing campaigns) using stress test results and business calendars.
- Integrate business growth projections from finance teams into capacity models, adjusting for product launch timelines.
- Choose between predictive (forecast-based) and reactive (metric-driven) scaling strategies for different application tiers.
- Validate forecast accuracy quarterly by comparing predicted vs. actual resource consumption and recalibrating models.
- Define thresholds for scaling events that avoid thrashing while maintaining SLA compliance during traffic spikes.
Module 4: Designing for Elasticity and Autoscaling
- Configure horizontal pod autoscalers (HPA) in Kubernetes based on custom metrics like requests per second, not just CPU.
- Set cooldown periods and scaling step sizes to prevent rapid scale-in after scale-out during transient load spikes.
- Implement predictive scaling using AWS Auto Scaling plans or Azure Autoscale rules with scheduled actions.
- Design warm-up procedures for stateful applications to ensure new instances are fully operational before receiving traffic.
- Test autoscaling behavior under failure conditions (e.g., AZ outage) to ensure capacity rebalancing works as intended.
- Monitor scaling event logs to identify patterns of unnecessary scaling and refine trigger conditions.
Module 5: Storage and Data Tiering Optimization
- Map application IOPS and latency requirements to cloud storage classes (e.g., gp3 vs. io2 on AWS).
- Implement lifecycle policies to automatically transition infrequently accessed data to lower-cost tiers (e.g., S3 Standard-IA).
- Size database volumes with growth projections, including overhead for snapshots and transaction logs.
- Decide between provisioned and burstable storage based on predictable vs. variable I/O patterns.
- Configure read replicas and caching layers to reduce load on primary databases and optimize storage performance.
- Monitor storage utilization trends to trigger volume resizing or migration before performance bottlenecks occur.
Module 6: Cost-Aware Capacity Governance
- Implement tagging policies to allocate cloud compute and storage costs to business units and projects.
- Use reserved instances or savings plans for predictable, steady-state workloads after analyzing utilization history.
- Set budget alerts and automated shutdown policies for non-production environments to prevent idle resource waste.
- Conduct monthly cost reviews to identify underutilized resources and enforce decommissioning processes.
- Negotiate enterprise discount agreements with cloud providers based on projected multi-year usage commitments.
- Balance the use of spot instances for fault-tolerant workloads against the risk of instance termination during capacity shortages.
Module 7: Monitoring, Feedback Loops, and Continuous Adjustment
- Deploy monitoring agents to collect real-time performance data across hybrid environments (on-prem and cloud).
- Define key capacity health indicators (e.g., average CPU utilization, scaling event frequency) for executive reporting.
- Integrate capacity metrics into incident management systems to correlate performance issues with resource constraints.
- Establish quarterly capacity review meetings with application owners to validate forecast assumptions and adjust plans.
- Automate anomaly detection using machine learning tools (e.g., Amazon CloudWatch Anomaly Detection) to flag unexpected usage.
- Update capacity models based on architectural changes, such as containerization or microservices adoption.
Module 8: Cross-Functional Alignment and Change Management
- Coordinate with procurement to align cloud capacity planning with fiscal budget cycles and approval workflows.
- Engage application development teams early to understand upcoming feature releases that may impact resource demand.
- Define escalation paths for capacity emergencies, such as unexpected traffic surges requiring immediate provisioning.
- Document capacity planning decisions in architecture review boards (ARBs) to ensure consistency across projects.
- Train operations teams on interpreting capacity dashboards and executing predefined scaling playbooks.
- Align capacity KPIs with business outcomes (e.g., transaction throughput, user response time) to maintain stakeholder focus.