Description

This curriculum spans the technical and organisational practices found in multi-workshop capacity governance programs, covering demand forecasting, infrastructure modeling, cloud cost controls, performance baselining, and resilience planning across hybrid environments.

Module 1: Defining Capacity Requirements and Demand Forecasting

Selecting time-series forecasting models (e.g., exponential smoothing vs. ARIMA) based on data availability and historical volatility in transaction volumes.
Integrating business growth projections from finance teams into IT capacity models while accounting for mergers, product launches, or market expansions.
Establishing thresholds for acceptable forecast error and defining recalibration cycles for predictive models.
Mapping business service tiers to transaction volume expectations and peak concurrency requirements.
Deciding between centralized and decentralized demand collection processes across business units.
Handling seasonality and one-off events (e.g., end-of-quarter reporting spikes) in baseline capacity models.

Module 2: Infrastructure Sizing and Resource Modeling

Calculating CPU, memory, and I/O requirements for virtualized workloads using performance benchmarks from pilot deployments.
Choosing between overprovisioning and dynamic scaling strategies for cloud-hosted applications based on cost and performance SLAs.
Modeling storage growth for structured and unstructured data with retention policies and compression ratios.
Assessing the impact of container density on node-level resource contention in Kubernetes clusters.
Validating sizing assumptions through load testing with production-like data sets and user behavior patterns.
Adjusting resource allocation models to account for software bloat or inefficiencies in legacy applications.

Module 3: Cloud and Hybrid Capacity Management

Designing auto-scaling policies that balance cost, latency, and instance warm-up time across AWS, Azure, or GCP.
Allocating reserved instances versus on-demand instances based on predictable usage patterns and discount break-even analysis.
Monitoring cross-region data transfer costs and egress fees when distributing capacity across availability zones.
Implementing tagging and chargeback frameworks to attribute cloud spend to business units accurately.
Managing cold start risks in serverless environments during sudden traffic surges.
Enforcing capacity quotas and approval workflows to prevent uncontrolled resource provisioning in self-service cloud platforms.

Module 4: Performance Monitoring and Baseline Establishment

Selecting key performance indicators (KPIs) such as response time, throughput, and error rates for critical business transactions.
Deploying synthetic transaction monitoring to detect performance degradation before user impact.
Establishing performance baselines during normal operations to detect anomalies and plan for growth.
Integrating monitoring tools (e.g., Prometheus, Datadog) with capacity planning databases for trend analysis.
Filtering out noise from monitoring data caused by batch jobs, backups, or maintenance tasks.
Defining alert thresholds that trigger capacity reviews without generating excessive false positives.

Module 5: Capacity Governance and Decision Frameworks

Creating a capacity review board with representation from infrastructure, application, and business teams to prioritize investments.
Documenting capacity decisions and assumptions in a centralized repository for audit and continuity purposes.
Setting escalation paths for capacity shortfalls that threaten service level objectives (SLOs).
Aligning capacity refresh cycles with hardware end-of-life and software support timelines.
Enforcing standard templates for capacity requests to ensure consistent evaluation across departments.
Balancing short-term tactical fixes (e.g., vertical scaling) against long-term architectural improvements.

Module 6: Scalability Architecture and Design Patterns

Choosing between vertical and horizontal scaling based on application statefulness and database constraints.
Implementing read replicas and sharding strategies to distribute database load under high query volumes.
Evaluating message queue backlogs as indicators of downstream processing bottlenecks.
Designing stateless application layers to enable seamless horizontal scaling and failover.
Assessing the trade-offs of caching strategies (in-memory vs. distributed) on memory capacity and data consistency.
Integrating circuit breakers and rate limiting to prevent cascading failures during capacity saturation.

Module 7: Risk Management and Contingency Planning

Conducting stress tests to identify breaking points and define safe operating limits for production systems.
Developing surge capacity plans for disaster recovery scenarios involving workload failover.
Quantifying the risk of under-provisioning using historical incident data and outage cost estimates.
Establishing pre-approved budget and procurement pathways for emergency capacity acquisition.
Documenting fallback procedures when auto-scaling fails to keep pace with demand spikes.
Reviewing third-party dependency capacity (e.g., APIs, CDNs) as part of end-to-end service resilience planning.

Module 8: Continuous Improvement and Optimization

Conducting post-incident reviews after capacity-related outages to update forecasting models and thresholds.
Reclaiming underutilized resources through rightsizing initiatives and decommissioning idle instances.
Integrating capacity feedback loops into CI/CD pipelines to assess performance impact of new releases.
Using utilization heatmaps to identify opportunities for workload consolidation or migration.
Updating capacity models quarterly based on actual usage trends and business trajectory changes.
Benchmarking capacity efficiency metrics (e.g., cost per transaction, utilization rates) across peer systems.