This curriculum spans the technical, operational, and organisational practices found in mature AI platform teams, comparable to a multi-workshop program that integrates capacity planning, incident response, and cross-functional governance across infrastructure, machine learning, and product units.
Module 1: Defining and Measuring AI System Capacity
- Selecting appropriate throughput metrics (e.g., queries per second, tokens processed per hour) based on model type and deployment environment
- Establishing baseline performance under controlled load using synthetic workloads that mirror production data patterns
- Instrumenting inference pipelines with granular latency and queue depth monitoring at each processing stage
- Identifying hardware bottlenecks (GPU VRAM, CPU memory bandwidth, interconnect saturation) through profiling tools like NVIDIA Nsight or PyTorch Profiler
- Quantifying the impact of variable input length on batch processing efficiency in transformer-based models
- Calibrating capacity thresholds that trigger scaling actions while avoiding thrashing due to transient spikes
- Mapping user SLAs (e.g., p95 latency < 500ms) to infrastructure provisioning requirements
- Designing capacity tests that account for cold-start effects in serverless inference environments
Module 2: Infrastructure Provisioning and Resource Allocation
- Choosing between dedicated instances, spot instances, and reserved capacity for cost-performance trade-offs in cloud deployments
- Right-sizing GPU instances based on model memory footprint and computational intensity benchmarks
- Partitioning shared cluster resources across multiple AI workloads using Kubernetes namespaces and resource quotas
- Implementing node affinity and taints to ensure latency-sensitive models run on high-performance hardware
- Configuring autoscaling groups with predictive and reactive triggers based on queue backlog and GPU utilization
- Managing burst capacity for batch inference jobs without disrupting real-time serving workloads
- Allocating memory overhead for model loading, caching, and framework operations beyond raw model size
- Designing multi-region failover strategies that preserve capacity during regional outages
Module 3: Model Optimization for Capacity Efficiency
- Applying model quantization (e.g., FP16, INT8) and evaluating accuracy degradation against latency gains
- Implementing dynamic batching with adaptive batch size tuning based on incoming request patterns
- Using model pruning to reduce parameter count while maintaining inference quality within acceptable bounds
- Deploying distillation techniques to replace large teacher models with faster, smaller student models
- Integrating speculative decoding to accelerate autoregressive generation without compromising output quality
- Selecting appropriate attention mechanisms (e.g., FlashAttention) to reduce memory bandwidth constraints
- Optimizing model checkpoints for fast loading and reduced initialization time during scaling events
- Profiling kernel execution times to identify and eliminate inefficient operations in computational graphs
Module 4: Workload Prioritization and Throttling Strategies
- Implementing request queuing with priority levels for high-value customers or critical internal systems
- Designing rate-limiting policies that differentiate between API consumers based on contractual tiers
- Enforcing fair-share scheduling across departments sharing a centralized AI platform
- Configuring circuit breakers to halt low-priority workloads during capacity emergencies
- Routing overflow traffic to lower-fidelity models when primary endpoints are saturated
- Logging and auditing throttled requests for post-incident analysis and capacity planning
- Developing SLA-based penalty calculations for internal chargeback models during overutilization
- Implementing graceful degradation by reducing response fidelity (e.g., shorter generations) under load
Module 5: Scaling Patterns and Deployment Topologies
- Choosing between vertical scaling (larger instances) and horizontal scaling (more replicas) based on model memory constraints
- Designing canary deployments that validate capacity assumptions before full rollout
- Implementing model parallelism for large models that exceed single-device memory capacity
- Configuring rolling updates with surge capacity to maintain availability during version transitions
- Deploying edge inference nodes to reduce central cluster load for geographically distributed users
- Integrating model mesh architectures to enable shared compute pools across multiple services
- Using preemptible nodes for batch workloads with restart tolerance to reduce operational costs
- Validating scaling policies under mixed workload conditions to prevent resource starvation
Module 6: Monitoring, Alerting, and Capacity Forecasting
- Establishing baseline capacity utilization trends by time of day, day of week, and business cycle
- Setting dynamic alert thresholds using statistical process control instead of static limits
- Correlating model performance degradation with infrastructure metrics to isolate root causes
- Forecasting capacity needs using time series models trained on historical usage and business KPIs
- Integrating business event calendars (e.g., product launches) into predictive scaling models
- Creating cross-stack dashboards that unify application, infrastructure, and business metrics
- Automating capacity reviews with anomaly detection on forecasting residuals
- Tracking model efficiency decay over time as input distributions drift from training data
Module 7: Cost Governance and Financial Controls
- Implementing tagging and labeling strategies to attribute AI compute costs to business units
- Setting budget alerts with automated enforcement actions (e.g., deployment freezes) at threshold breaches
- Conducting cost-per-inference analysis across model variants to inform optimization priorities
- Enforcing model retirement policies for underutilized endpoints consuming idle capacity
- Negotiating committed use discounts based on forecasted minimum capacity requirements
- Auditing model version sprawl and consolidating redundant deployments
- Implementing approval workflows for high-cost operations (e.g., large-scale fine-tuning jobs)
- Comparing TCO of on-prem vs. cloud for long-running inference workloads
Module 8: Incident Response and Capacity Recovery
- Executing predefined runbooks for capacity exhaustion scenarios with clear role assignments
- Initiating emergency scaling procedures while maintaining system stability under duress
- Rolling back recent deployments that caused unexpected capacity spikes
- Engaging model owners to optimize inefficient inference patterns during outages
- Documenting post-mortems that link capacity incidents to specific architectural or operational decisions
- Updating forecasting models with incident data to improve future predictions
- Validating recovery by measuring stabilization of key metrics (latency, error rate, queue depth)
- Rebalancing workloads across clusters to restore redundancy after failover events
Module 9: Cross-Functional Capacity Governance
- Establishing capacity review boards with representation from infrastructure, ML, and product teams
- Defining capacity SLIs and SLOs that align technical performance with business outcomes
- Requiring capacity impact assessments for all new model deployments
- Creating standardized capacity testing protocols for vendor and third-party models
- Enforcing model registration requirements that include efficiency benchmarks
- Coordinating capacity planning cycles with fiscal and product roadmaps
- Developing escalation paths for capacity conflicts between business units
- Auditing compliance with data retention policies that impact storage and processing capacity