This curriculum spans the technical, operational, and organisational dimensions of capacity management, comparable in scope to a multi-workshop program embedded within an enterprise’s internal performance engineering practice, addressing real-world challenges from instrumentation and forecasting to governance and cross-team alignment.
Module 1: Defining Capacity and Performance Metrics
- Selecting between peak vs. sustained capacity thresholds when sizing infrastructure for transactional systems.
- Deciding whether to track utilization at the hardware level (e.g., CPU %) or at the service level (e.g., requests per second).
- Aligning metric definitions across teams to ensure consistency in reporting between infrastructure, application, and business units.
- Determining the appropriate level of granularity for metrics—per-server, per-service, or per-tenant—in multi-tenant environments.
- Choosing between absolute values (e.g., 85% CPU) and derived indicators (e.g., CPU ready time) for virtualized environments.
- Establishing baseline performance profiles during normal operations to detect deviations in real-time monitoring.
Module 2: Instrumentation and Data Collection
- Configuring agent-based vs. agentless monitoring based on security constraints and host OS diversity.
- Setting sampling intervals to balance data fidelity with storage and processing overhead in high-volume systems.
- Integrating custom application-level metrics into centralized telemetry platforms without introducing latency.
- Managing credential access and encryption for collectors pulling data from production databases and middleware.
- Filtering noisy metrics at the collection layer to reduce false alerts in downstream analysis.
- Validating time synchronization across distributed nodes to ensure accurate correlation of performance events.
Module 3: Thresholds, Alerts, and Anomaly Detection
- Setting dynamic thresholds using historical baselines instead of static percentages to reduce alert fatigue.
- Defining escalation paths for alerts based on business impact rather than technical severity alone.
- Suppressing alerts during scheduled maintenance windows without masking unintended outages.
- Configuring hysteresis in alert triggers to prevent flapping during transient load spikes.
- Evaluating the false positive rate of anomaly detection models before deploying them in production.
- Assigning ownership of alert response based on service ownership maps in hybrid operational models.
Module 4: Capacity Modeling and Forecasting
- Selecting between linear, exponential, and logistic growth models based on historical usage trends and business trajectory.
- Incorporating seasonal demand patterns (e.g., fiscal year-end, holiday spikes) into long-term forecasts.
- Adjusting forecast models when major application changes or architectural refactors alter resource consumption profiles.
- Quantifying uncertainty ranges in forecasts to inform buffer capacity decisions and risk planning.
- Validating forecast accuracy by back-testing against past data and refining model parameters.
- Aligning forecast outputs with procurement lead times to ensure timely hardware or cloud resource acquisition.
Module 5: Resource Allocation and Right-Sizing
- Right-sizing virtual machines based on actual utilization, considering both CPU and memory pressure.
- Deciding between vertical and horizontal scaling strategies in containerized environments.
- Implementing automated scaling policies while preventing thrashing due to rapid load fluctuations.
- Allocating shared resources (e.g., database connections, thread pools) to prevent contention across services.
- Enforcing resource quotas in multi-tenant platforms to prevent noisy neighbor effects.
- Rebalancing workloads across clusters during hardware refresh cycles or data center migrations.
Module 6: Cost-Performance Trade-Offs
- Choosing between on-demand and reserved cloud instances based on forecasted utilization and budget constraints.
- Evaluating the cost of over-provisioning against the risk of performance degradation during unexpected demand.
- Assessing the total cost of ownership (TCO) for on-premises hardware, including power, cooling, and floor space.
- Justifying investment in performance optimization versus simply scaling infrastructure to meet demand.
- Implementing auto-remediation for underutilized resources to reduce cloud spend without impacting SLAs.
- Negotiating service-level agreements that reflect realistic capacity constraints and cost implications.
Module 7: Governance and Compliance in Capacity Planning
- Documenting capacity decisions to support audit requirements for regulated workloads (e.g., HIPAA, PCI).
- Establishing change control processes for capacity-related infrastructure modifications.
- Defining retention policies for performance data based on legal, operational, and storage considerations.
- Ensuring capacity planning aligns with disaster recovery and business continuity requirements.
- Reporting capacity utilization to executive stakeholders using standardized, non-technical dashboards.
- Conducting periodic capacity reviews with application owners to validate assumptions and update forecasts.
Module 8: Cross-Functional Integration and Continuous Improvement
- Integrating capacity metrics into incident post-mortems to identify resource-related root causes.
- Collaborating with development teams to influence code efficiency and reduce per-request resource consumption.
- Feeding capacity data into CI/CD pipelines to detect performance regressions before deployment.
- Standardizing metric schemas across teams to enable centralized capacity analytics and reporting.
- Conducting blameless capacity drills to test response readiness for resource exhaustion scenarios.
- Updating capacity models quarterly based on actual usage, business changes, and technology refreshes.