Description

This curriculum spans the technical and operational rigor of a multi-workshop infrastructure planning engagement, addressing the same capacity modeling, hardware integration, and cross-system coordination challenges faced when deploying AI workloads at scale in production data centers.

Module 1: Defining Capacity Requirements for AI Workloads

Size GPU and TPU clusters based on model training duration targets and batch processing SLAs under variable load conditions.
Quantify memory bandwidth and interconnect requirements for distributed training jobs to avoid bottlenecks in all-reduce operations.
Allocate persistent storage capacity for model checkpoints, logs, and dataset versions considering retention policies and recovery needs.
Estimate power draw per rack unit during peak inference cycles to align with existing PDU and UPS headroom.
Model capacity elasticity needs for burstable inference workloads against reserved versus on-demand hardware provisioning.
Balance model precision (FP32 vs FP16 vs INT8) with hardware utilization efficiency and accuracy degradation thresholds.
Project storage growth from data pipeline outputs, including augmented and preprocessed datasets, over a 12-month horizon.
Map AI pipeline stages (data ingestion, training, validation, serving) to distinct capacity zones with differentiated QoS policies.

Module 2: Infrastructure Provisioning and Hardware Selection

Select server SKUs based on PCIe lane availability, GPU-to-GPU NVLink topology, and thermal envelope for dense configurations.
Compare throughput per watt across GPU generations when procuring hardware under constrained power budgets.
Integrate liquid-cooled GPU racks into existing air-cooled data halls, including retrofitting manifold and chiller interfaces.
Validate firmware compatibility across GPU drivers, BMCs, and host OS versions before large-scale deployment.
Procure spare GPUs and memory modules at 5–10% over baseline to cover field failure replacements without service disruption.
Implement firmware signing and secure boot across AI nodes to meet organizational hardware trust policies.
Coordinate with network teams to provision 200/400 GbE or InfiniBand spines for low-latency RDMA in training clusters.
Enforce hardware lifecycle policies that align GPU depreciation schedules with model retraining cadence.

Module 3: Power and Thermal Capacity Planning

Calculate heat density per rack (kW/rack) for GPU nodes under sustained load and validate against CRAC unit capacity.
Model seasonal PUE variance due to ambient temperature changes and adjust capacity forecasts accordingly.
Implement dynamic power capping at the rack level to prevent circuit breaker trips during cluster-wide job starts.
Deploy in-rack thermal sensors with sub-minute polling to detect hot spots before triggering cooling overrides.
Coordinate AI cluster deployment phasing with electrical substation upgrade timelines to avoid over-subscription.
Allocate headroom for AI workloads within overall data center power envelope, factoring in non-AI systems.
Use DCIM tools to simulate power failover scenarios for AI training jobs during UPS or generator testing.
Negotiate with facilities to reserve chilled water loop capacity for future AI expansion zones.

Module 4: Storage Architecture and Data Pipeline Scaling

Design parallel file systems (e.g., Lustre, WekaIO) with sufficient metadata server capacity to handle millions of small checkpoint files.
Implement tiered storage policies that migrate cold training artifacts from NVMe to object storage based on access patterns.
Size local SSD cache on GPU nodes to reduce latency for frequently accessed dataset shards during multi-epoch training.
Provision high-throughput data pipelines from object storage to training clusters using parallel download agents.
Enforce data replication factors across availability zones for training datasets to prevent single-point-of-failure.
Monitor storage IOPS and bandwidth utilization during peak data loading and adjust buffer pool sizes accordingly.
Integrate data versioning systems (e.g., DVC) with storage quotas to prevent uncontrolled dataset sprawl.
Pre-stage datasets in regional data centers to minimize cross-site bandwidth consumption during distributed training.

Module 5: Network Capacity and Latency Optimization

Design non-blocking network topologies for GPU clusters to maintain full bisection bandwidth during all-reduce operations.
Configure QoS policies to prioritize RDMA traffic over best-effort management and monitoring traffic.
Measure end-to-end latency between parameter servers and workers and adjust network buffer settings to minimize jitter.
Implement network segmentation to isolate AI training, inference, and management traffic for performance and security.
Size network buffers and queue depths to handle bursty data transfer patterns from data preprocessing pipelines.
Validate MTU alignment across switches, NICs, and hosts to prevent packet fragmentation in high-throughput jobs.
Monitor for microbursts using packet capture tools and adjust job scheduling concurrency to smooth traffic.
Plan for future upgrade paths to next-gen interconnects (e.g., 800GbE, next-gen InfiniBand) in network fabric design.

Module 6: Capacity Governance and Resource Allocation

Implement role-based quota systems for GPU, CPU, and memory allocation across research, production, and staging teams.
Enforce job time limits and auto-termination policies for interactive Jupyter environments to prevent resource hoarding.
Track per-project resource consumption for chargeback or showback reporting using monitoring telemetry.
Establish approval workflows for over-quota requests tied to business justification and model ROI estimates.
Define priority classes for jobs (e.g., production inference > retraining > experimentation) in the scheduler.
Integrate capacity requests into CI/CD pipelines to validate resource availability before model deployment.
Conduct monthly capacity reviews with stakeholders to reconcile forecasted versus actual usage.
Implement soft eviction policies for preemptible training jobs to maximize hardware utilization without violating SLAs.

Module 7: Monitoring, Alerting, and Capacity Forecasting

Deploy GPU telemetry agents to collect utilization, memory, and temperature metrics at 10-second intervals.
Set dynamic thresholds for capacity alerts based on historical usage patterns and seasonal trends.
Correlate job scheduling logs with power draw data to identify inefficient job packing or scheduling gaps.
Forecast 6-month capacity needs using time-series models trained on job submission, runtime, and resource profiles.
Integrate capacity forecasts with procurement lead times to trigger hardware orders at optimal intervals.
Visualize capacity utilization heatmaps across racks, zones, and clusters to identify underused hardware.
Monitor storage growth rates per team and enforce alerts when growth exceeds allocated projections.
Use anomaly detection to flag unexpected capacity consumption from misconfigured or runaway AI jobs.

Module 8: Disaster Recovery and Capacity Resilience

Replicate critical model checkpoints and datasets to secondary data centers with defined RPO and RTO targets.
Size standby GPU clusters at disaster recovery sites based on priority workload rankings and failover scope.
Test failover of inference endpoints by rerouting DNS and load balancers to secondary region during maintenance windows.
Validate backup integrity of training environments, including container images and dependency configurations.
Document manual intervention steps for resuming long-running training jobs after site-level outages.
Implement geo-distributed data lakes to ensure dataset availability during regional network disruptions.
Conduct quarterly DR drills that simulate power, network, and storage failures in AI clusters.
Define capacity rollback procedures when failed model deployments trigger reversion to prior stable versions.

Module 9: Cost Optimization and Capacity Efficiency

Right-size GPU instances for inference workloads using profiling data from canary deployments.
Implement autoscaling groups for inference endpoints based on request rate and latency thresholds.
Consolidate low-utilization training jobs onto shared clusters using container isolation and scheduling priorities.
Negotiate multi-year hardware leases with vendors to reduce TCO under predictable capacity growth.
Decommission idle or underutilized nodes and reallocate components to higher-priority projects.
Use spot or preemptible instances for non-critical training jobs with checkpointing enabled every 15 minutes.
Optimize batch sizes and gradient accumulation steps to maximize GPU utilization without OOM errors.
Conduct quarterly cost-per-model analyses to identify inefficiencies in training or serving pipelines.