This curriculum spans the technical and organisational practices found in multi-workshop capacity planning programs, covering metric definition, data pipeline design, forecasting, and governance workflows similar to those in enterprise SLM and cloud optimisation initiatives.
Module 1: Defining Capacity Metrics Aligned with Business Services
- Select which transaction types to monitor for capacity analysis based on business criticality and transaction volume thresholds.
- Determine whether to use peak-hour or sustained-load metrics when defining service capacity baselines.
- Decide whether to include dependent backend systems (e.g., databases, APIs) in the service boundary for capacity reporting.
- Negotiate with business stakeholders on acceptable response time thresholds that trigger capacity alerts.
- Choose between business transaction counts, API calls, or user sessions as primary workload units for reporting.
- Implement tagging strategies to differentiate production, staging, and test traffic in capacity data aggregation.
Module 2: Instrumentation and Data Collection Architecture
- Configure APM agents to sample high-volume transactions without overwhelming data pipelines.
- Select which performance counters to collect from virtualized and containerized environments (e.g., CPU steal time, container memory limits).
- Integrate synthetic transaction monitoring with real user monitoring to validate capacity assumptions.
- Design log sampling rates to balance diagnostic fidelity with storage cost in high-throughput systems.
- Implement secure credential handling for monitoring tools accessing production databases and middleware.
- Establish data retention policies for raw performance telemetry based on compliance and troubleshooting needs.
Module 3: Establishing Baselines and Thresholds
- Calculate seasonal baselines for capacity utilization using historical data across business cycles (e.g., month-end, holiday peaks).
- Determine whether to use static thresholds or dynamic baselines (e.g., machine learning-based anomaly detection) for alerting.
- Adjust baseline calculations to exclude known outage periods or maintenance windows.
- Define separate thresholds for warning and critical states based on mean time to repair (MTTR) and failover capabilities.
- Validate baseline accuracy by comparing forecasted vs. actual utilization during planned load events.
- Document exceptions for systems with non-recurring usage patterns (e.g., batch processing jobs).
Module 4: Forecasting Demand and Growth Trends
- Select forecasting models (e.g., linear regression, exponential smoothing) based on historical data stability and seasonality.
- Incorporate upcoming business initiatives (e.g., product launches, marketing campaigns) into demand projections.
- Adjust forecast inputs when development teams migrate workloads to new platforms or cloud regions.
- Quantify uncertainty ranges in forecasts and communicate them to infrastructure planning teams.
- Update forecast assumptions when observed growth deviates significantly from projections (e.g., >15% variance).
- Coordinate with finance to align capacity forecasts with capital expenditure cycles.
Module 5: Reporting Structure and Stakeholder Communication
- Design executive dashboards to show capacity headroom as a percentage of maximum sustainable load.
- Segment reports by business unit or service owner to assign accountability for capacity actions.
- Include trend arrows and color coding to highlight services approaching or exceeding thresholds.
- Suppress non-actionable alerts in reports when capacity constraints are already addressed in the roadmap.
- Version control capacity reports to support audit requirements and track historical decisions.
- Automate report distribution to stakeholders while enforcing role-based access to sensitive data.
Module 6: Integration with Change and Incident Management
- Require capacity impact assessments for all changes involving high-load components or data volume increases.
- Correlate incident timelines with capacity spikes to determine if resource exhaustion contributed to outages.
- Flag changes that introduce new transaction types not covered in existing capacity monitoring.
- Update capacity models after major configuration changes (e.g., database sharding, load balancer rules).
- Link capacity reports to post-incident reviews to validate root cause hypotheses related to resource limits.
- Enforce pre-implementation capacity sign-off for projects expected to increase load by more than 20%.
Module 7: Governance and Continuous Improvement
- Establish a capacity review board to evaluate high-risk services and approve remediation plans.
- Define SLA-backed capacity targets for critical services and track adherence monthly.
- Conduct quarterly audits of monitoring coverage to identify uninstrumented critical components.
- Retire outdated capacity models when application architecture changes invalidate assumptions.
- Measure and report on the accuracy of past forecasts to improve modeling practices.
- Enforce naming and tagging standards across monitoring tools to ensure report consistency.
Module 8: Cloud and Hybrid Environment Considerations
- Differentiate between committed and burstable capacity in cloud environments when reporting headroom.
- Monitor reserved instance utilization to identify underused commitments and optimize costs.
- Account for network egress charges when modeling scalability of data-intensive services.
- Integrate cloud provider auto-scaling logs into capacity reports to assess scaling effectiveness.
- Report on cold-start latency impacts in serverless environments during traffic surges.
- Align capacity reporting intervals with cloud billing cycles for cost-capacity correlation analysis.