Description

This curriculum spans the technical and organisational practices found in multi-workshop capacity planning programs, covering metric definition, data pipeline design, forecasting, and governance workflows similar to those in enterprise SLM and cloud optimisation initiatives.

Module 1: Defining Capacity Metrics Aligned with Business Services

Select which transaction types to monitor for capacity analysis based on business criticality and transaction volume thresholds.
Determine whether to use peak-hour or sustained-load metrics when defining service capacity baselines.
Decide whether to include dependent backend systems (e.g., databases, APIs) in the service boundary for capacity reporting.
Negotiate with business stakeholders on acceptable response time thresholds that trigger capacity alerts.
Choose between business transaction counts, API calls, or user sessions as primary workload units for reporting.
Implement tagging strategies to differentiate production, staging, and test traffic in capacity data aggregation.

Module 2: Instrumentation and Data Collection Architecture

Configure APM agents to sample high-volume transactions without overwhelming data pipelines.
Select which performance counters to collect from virtualized and containerized environments (e.g., CPU steal time, container memory limits).
Integrate synthetic transaction monitoring with real user monitoring to validate capacity assumptions.
Design log sampling rates to balance diagnostic fidelity with storage cost in high-throughput systems.
Implement secure credential handling for monitoring tools accessing production databases and middleware.
Establish data retention policies for raw performance telemetry based on compliance and troubleshooting needs.

Module 3: Establishing Baselines and Thresholds

Calculate seasonal baselines for capacity utilization using historical data across business cycles (e.g., month-end, holiday peaks).
Determine whether to use static thresholds or dynamic baselines (e.g., machine learning-based anomaly detection) for alerting.
Adjust baseline calculations to exclude known outage periods or maintenance windows.
Define separate thresholds for warning and critical states based on mean time to repair (MTTR) and failover capabilities.
Validate baseline accuracy by comparing forecasted vs. actual utilization during planned load events.
Document exceptions for systems with non-recurring usage patterns (e.g., batch processing jobs).

Module 4: Forecasting Demand and Growth Trends

Select forecasting models (e.g., linear regression, exponential smoothing) based on historical data stability and seasonality.
Incorporate upcoming business initiatives (e.g., product launches, marketing campaigns) into demand projections.
Adjust forecast inputs when development teams migrate workloads to new platforms or cloud regions.
Quantify uncertainty ranges in forecasts and communicate them to infrastructure planning teams.
Update forecast assumptions when observed growth deviates significantly from projections (e.g., >15% variance).
Coordinate with finance to align capacity forecasts with capital expenditure cycles.

Module 5: Reporting Structure and Stakeholder Communication

Design executive dashboards to show capacity headroom as a percentage of maximum sustainable load.
Segment reports by business unit or service owner to assign accountability for capacity actions.
Include trend arrows and color coding to highlight services approaching or exceeding thresholds.
Suppress non-actionable alerts in reports when capacity constraints are already addressed in the roadmap.
Version control capacity reports to support audit requirements and track historical decisions.
Automate report distribution to stakeholders while enforcing role-based access to sensitive data.

Module 6: Integration with Change and Incident Management

Require capacity impact assessments for all changes involving high-load components or data volume increases.
Correlate incident timelines with capacity spikes to determine if resource exhaustion contributed to outages.
Flag changes that introduce new transaction types not covered in existing capacity monitoring.
Update capacity models after major configuration changes (e.g., database sharding, load balancer rules).
Link capacity reports to post-incident reviews to validate root cause hypotheses related to resource limits.
Enforce pre-implementation capacity sign-off for projects expected to increase load by more than 20%.

Module 7: Governance and Continuous Improvement

Establish a capacity review board to evaluate high-risk services and approve remediation plans.
Define SLA-backed capacity targets for critical services and track adherence monthly.
Conduct quarterly audits of monitoring coverage to identify uninstrumented critical components.
Retire outdated capacity models when application architecture changes invalidate assumptions.
Measure and report on the accuracy of past forecasts to improve modeling practices.
Enforce naming and tagging standards across monitoring tools to ensure report consistency.

Module 8: Cloud and Hybrid Environment Considerations

Differentiate between committed and burstable capacity in cloud environments when reporting headroom.
Monitor reserved instance utilization to identify underused commitments and optimize costs.
Account for network egress charges when modeling scalability of data-intensive services.
Integrate cloud provider auto-scaling logs into capacity reports to assess scaling effectiveness.
Report on cold-start latency impacts in serverless environments during traffic surges.
Align capacity reporting intervals with cloud billing cycles for cost-capacity correlation analysis.