This curriculum spans the technical and organisational practices found in multi-workshop capacity governance programs, covering demand forecasting, infrastructure modeling, cloud cost controls, performance baselining, and resilience planning across hybrid environments.
Module 1: Defining Capacity Requirements and Demand Forecasting
- Selecting time-series forecasting models (e.g., exponential smoothing vs. ARIMA) based on data availability and historical volatility in transaction volumes.
- Integrating business growth projections from finance teams into IT capacity models while accounting for mergers, product launches, or market expansions.
- Establishing thresholds for acceptable forecast error and defining recalibration cycles for predictive models.
- Mapping business service tiers to transaction volume expectations and peak concurrency requirements.
- Deciding between centralized and decentralized demand collection processes across business units.
- Handling seasonality and one-off events (e.g., end-of-quarter reporting spikes) in baseline capacity models.
Module 2: Infrastructure Sizing and Resource Modeling
- Calculating CPU, memory, and I/O requirements for virtualized workloads using performance benchmarks from pilot deployments.
- Choosing between overprovisioning and dynamic scaling strategies for cloud-hosted applications based on cost and performance SLAs.
- Modeling storage growth for structured and unstructured data with retention policies and compression ratios.
- Assessing the impact of container density on node-level resource contention in Kubernetes clusters.
- Validating sizing assumptions through load testing with production-like data sets and user behavior patterns.
- Adjusting resource allocation models to account for software bloat or inefficiencies in legacy applications.
Module 3: Cloud and Hybrid Capacity Management
- Designing auto-scaling policies that balance cost, latency, and instance warm-up time across AWS, Azure, or GCP.
- Allocating reserved instances versus on-demand instances based on predictable usage patterns and discount break-even analysis.
- Monitoring cross-region data transfer costs and egress fees when distributing capacity across availability zones.
- Implementing tagging and chargeback frameworks to attribute cloud spend to business units accurately.
- Managing cold start risks in serverless environments during sudden traffic surges.
- Enforcing capacity quotas and approval workflows to prevent uncontrolled resource provisioning in self-service cloud platforms.
Module 4: Performance Monitoring and Baseline Establishment
- Selecting key performance indicators (KPIs) such as response time, throughput, and error rates for critical business transactions.
- Deploying synthetic transaction monitoring to detect performance degradation before user impact.
- Establishing performance baselines during normal operations to detect anomalies and plan for growth.
- Integrating monitoring tools (e.g., Prometheus, Datadog) with capacity planning databases for trend analysis.
- Filtering out noise from monitoring data caused by batch jobs, backups, or maintenance tasks.
- Defining alert thresholds that trigger capacity reviews without generating excessive false positives.
Module 5: Capacity Governance and Decision Frameworks
- Creating a capacity review board with representation from infrastructure, application, and business teams to prioritize investments.
- Documenting capacity decisions and assumptions in a centralized repository for audit and continuity purposes.
- Setting escalation paths for capacity shortfalls that threaten service level objectives (SLOs).
- Aligning capacity refresh cycles with hardware end-of-life and software support timelines.
- Enforcing standard templates for capacity requests to ensure consistent evaluation across departments.
- Balancing short-term tactical fixes (e.g., vertical scaling) against long-term architectural improvements.
Module 6: Scalability Architecture and Design Patterns
- Choosing between vertical and horizontal scaling based on application statefulness and database constraints.
- Implementing read replicas and sharding strategies to distribute database load under high query volumes.
- Evaluating message queue backlogs as indicators of downstream processing bottlenecks.
- Designing stateless application layers to enable seamless horizontal scaling and failover.
- Assessing the trade-offs of caching strategies (in-memory vs. distributed) on memory capacity and data consistency.
- Integrating circuit breakers and rate limiting to prevent cascading failures during capacity saturation.
Module 7: Risk Management and Contingency Planning
- Conducting stress tests to identify breaking points and define safe operating limits for production systems.
- Developing surge capacity plans for disaster recovery scenarios involving workload failover.
- Quantifying the risk of under-provisioning using historical incident data and outage cost estimates.
- Establishing pre-approved budget and procurement pathways for emergency capacity acquisition.
- Documenting fallback procedures when auto-scaling fails to keep pace with demand spikes.
- Reviewing third-party dependency capacity (e.g., APIs, CDNs) as part of end-to-end service resilience planning.
Module 8: Continuous Improvement and Optimization
- Conducting post-incident reviews after capacity-related outages to update forecasting models and thresholds.
- Reclaiming underutilized resources through rightsizing initiatives and decommissioning idle instances.
- Integrating capacity feedback loops into CI/CD pipelines to assess performance impact of new releases.
- Using utilization heatmaps to identify opportunities for workload consolidation or migration.
- Updating capacity models quarterly based on actual usage trends and business trajectory changes.
- Benchmarking capacity efficiency metrics (e.g., cost per transaction, utilization rates) across peer systems.