Description

This curriculum spans the technical, operational, and organisational dimensions of capacity management, comparable in scope to a multi-workshop program embedded within an enterprise’s internal performance engineering practice, addressing real-world challenges from instrumentation and forecasting to governance and cross-team alignment.

Module 1: Defining Capacity and Performance Metrics

Selecting between peak vs. sustained capacity thresholds when sizing infrastructure for transactional systems.
Deciding whether to track utilization at the hardware level (e.g., CPU %) or at the service level (e.g., requests per second).
Aligning metric definitions across teams to ensure consistency in reporting between infrastructure, application, and business units.
Determining the appropriate level of granularity for metrics—per-server, per-service, or per-tenant—in multi-tenant environments.
Choosing between absolute values (e.g., 85% CPU) and derived indicators (e.g., CPU ready time) for virtualized environments.
Establishing baseline performance profiles during normal operations to detect deviations in real-time monitoring.

Module 2: Instrumentation and Data Collection

Configuring agent-based vs. agentless monitoring based on security constraints and host OS diversity.
Setting sampling intervals to balance data fidelity with storage and processing overhead in high-volume systems.
Integrating custom application-level metrics into centralized telemetry platforms without introducing latency.
Managing credential access and encryption for collectors pulling data from production databases and middleware.
Filtering noisy metrics at the collection layer to reduce false alerts in downstream analysis.
Validating time synchronization across distributed nodes to ensure accurate correlation of performance events.

Module 3: Thresholds, Alerts, and Anomaly Detection

Setting dynamic thresholds using historical baselines instead of static percentages to reduce alert fatigue.
Defining escalation paths for alerts based on business impact rather than technical severity alone.
Suppressing alerts during scheduled maintenance windows without masking unintended outages.
Configuring hysteresis in alert triggers to prevent flapping during transient load spikes.
Evaluating the false positive rate of anomaly detection models before deploying them in production.
Assigning ownership of alert response based on service ownership maps in hybrid operational models.

Module 4: Capacity Modeling and Forecasting

Selecting between linear, exponential, and logistic growth models based on historical usage trends and business trajectory.
Incorporating seasonal demand patterns (e.g., fiscal year-end, holiday spikes) into long-term forecasts.
Adjusting forecast models when major application changes or architectural refactors alter resource consumption profiles.
Quantifying uncertainty ranges in forecasts to inform buffer capacity decisions and risk planning.
Validating forecast accuracy by back-testing against past data and refining model parameters.
Aligning forecast outputs with procurement lead times to ensure timely hardware or cloud resource acquisition.

Module 5: Resource Allocation and Right-Sizing

Right-sizing virtual machines based on actual utilization, considering both CPU and memory pressure.
Deciding between vertical and horizontal scaling strategies in containerized environments.
Implementing automated scaling policies while preventing thrashing due to rapid load fluctuations.
Allocating shared resources (e.g., database connections, thread pools) to prevent contention across services.
Enforcing resource quotas in multi-tenant platforms to prevent noisy neighbor effects.
Rebalancing workloads across clusters during hardware refresh cycles or data center migrations.

Module 6: Cost-Performance Trade-Offs

Choosing between on-demand and reserved cloud instances based on forecasted utilization and budget constraints.
Evaluating the cost of over-provisioning against the risk of performance degradation during unexpected demand.
Assessing the total cost of ownership (TCO) for on-premises hardware, including power, cooling, and floor space.
Justifying investment in performance optimization versus simply scaling infrastructure to meet demand.
Implementing auto-remediation for underutilized resources to reduce cloud spend without impacting SLAs.
Negotiating service-level agreements that reflect realistic capacity constraints and cost implications.

Module 7: Governance and Compliance in Capacity Planning

Documenting capacity decisions to support audit requirements for regulated workloads (e.g., HIPAA, PCI).
Establishing change control processes for capacity-related infrastructure modifications.
Defining retention policies for performance data based on legal, operational, and storage considerations.
Ensuring capacity planning aligns with disaster recovery and business continuity requirements.
Reporting capacity utilization to executive stakeholders using standardized, non-technical dashboards.
Conducting periodic capacity reviews with application owners to validate assumptions and update forecasts.

Module 8: Cross-Functional Integration and Continuous Improvement

Integrating capacity metrics into incident post-mortems to identify resource-related root causes.
Collaborating with development teams to influence code efficiency and reduce per-request resource consumption.
Feeding capacity data into CI/CD pipelines to detect performance regressions before deployment.
Standardizing metric schemas across teams to enable centralized capacity analytics and reporting.
Conducting blameless capacity drills to test response readiness for resource exhaustion scenarios.
Updating capacity models quarterly based on actual usage, business changes, and technology refreshes.