Description

This curriculum spans the technical, operational, and organisational practices found in mature AI platform teams, comparable to a multi-workshop program that integrates capacity planning, incident response, and cross-functional governance across infrastructure, machine learning, and product units.

Module 1: Defining and Measuring AI System Capacity

Selecting appropriate throughput metrics (e.g., queries per second, tokens processed per hour) based on model type and deployment environment
Establishing baseline performance under controlled load using synthetic workloads that mirror production data patterns
Instrumenting inference pipelines with granular latency and queue depth monitoring at each processing stage
Identifying hardware bottlenecks (GPU VRAM, CPU memory bandwidth, interconnect saturation) through profiling tools like NVIDIA Nsight or PyTorch Profiler
Quantifying the impact of variable input length on batch processing efficiency in transformer-based models
Calibrating capacity thresholds that trigger scaling actions while avoiding thrashing due to transient spikes
Mapping user SLAs (e.g., p95 latency < 500ms) to infrastructure provisioning requirements
Designing capacity tests that account for cold-start effects in serverless inference environments

Module 2: Infrastructure Provisioning and Resource Allocation

Choosing between dedicated instances, spot instances, and reserved capacity for cost-performance trade-offs in cloud deployments
Right-sizing GPU instances based on model memory footprint and computational intensity benchmarks
Partitioning shared cluster resources across multiple AI workloads using Kubernetes namespaces and resource quotas
Implementing node affinity and taints to ensure latency-sensitive models run on high-performance hardware
Configuring autoscaling groups with predictive and reactive triggers based on queue backlog and GPU utilization
Managing burst capacity for batch inference jobs without disrupting real-time serving workloads
Allocating memory overhead for model loading, caching, and framework operations beyond raw model size
Designing multi-region failover strategies that preserve capacity during regional outages

Module 3: Model Optimization for Capacity Efficiency

Applying model quantization (e.g., FP16, INT8) and evaluating accuracy degradation against latency gains
Implementing dynamic batching with adaptive batch size tuning based on incoming request patterns
Using model pruning to reduce parameter count while maintaining inference quality within acceptable bounds
Deploying distillation techniques to replace large teacher models with faster, smaller student models
Integrating speculative decoding to accelerate autoregressive generation without compromising output quality
Selecting appropriate attention mechanisms (e.g., FlashAttention) to reduce memory bandwidth constraints
Optimizing model checkpoints for fast loading and reduced initialization time during scaling events
Profiling kernel execution times to identify and eliminate inefficient operations in computational graphs

Module 4: Workload Prioritization and Throttling Strategies

Implementing request queuing with priority levels for high-value customers or critical internal systems
Designing rate-limiting policies that differentiate between API consumers based on contractual tiers
Enforcing fair-share scheduling across departments sharing a centralized AI platform
Configuring circuit breakers to halt low-priority workloads during capacity emergencies
Routing overflow traffic to lower-fidelity models when primary endpoints are saturated
Logging and auditing throttled requests for post-incident analysis and capacity planning
Developing SLA-based penalty calculations for internal chargeback models during overutilization
Implementing graceful degradation by reducing response fidelity (e.g., shorter generations) under load

Module 5: Scaling Patterns and Deployment Topologies

Choosing between vertical scaling (larger instances) and horizontal scaling (more replicas) based on model memory constraints
Designing canary deployments that validate capacity assumptions before full rollout
Implementing model parallelism for large models that exceed single-device memory capacity
Configuring rolling updates with surge capacity to maintain availability during version transitions
Deploying edge inference nodes to reduce central cluster load for geographically distributed users
Integrating model mesh architectures to enable shared compute pools across multiple services
Using preemptible nodes for batch workloads with restart tolerance to reduce operational costs
Validating scaling policies under mixed workload conditions to prevent resource starvation

Module 6: Monitoring, Alerting, and Capacity Forecasting

Establishing baseline capacity utilization trends by time of day, day of week, and business cycle
Setting dynamic alert thresholds using statistical process control instead of static limits
Correlating model performance degradation with infrastructure metrics to isolate root causes
Forecasting capacity needs using time series models trained on historical usage and business KPIs
Integrating business event calendars (e.g., product launches) into predictive scaling models
Creating cross-stack dashboards that unify application, infrastructure, and business metrics
Automating capacity reviews with anomaly detection on forecasting residuals
Tracking model efficiency decay over time as input distributions drift from training data

Module 7: Cost Governance and Financial Controls

Implementing tagging and labeling strategies to attribute AI compute costs to business units
Setting budget alerts with automated enforcement actions (e.g., deployment freezes) at threshold breaches
Conducting cost-per-inference analysis across model variants to inform optimization priorities
Enforcing model retirement policies for underutilized endpoints consuming idle capacity
Negotiating committed use discounts based on forecasted minimum capacity requirements
Auditing model version sprawl and consolidating redundant deployments
Implementing approval workflows for high-cost operations (e.g., large-scale fine-tuning jobs)
Comparing TCO of on-prem vs. cloud for long-running inference workloads

Module 8: Incident Response and Capacity Recovery

Executing predefined runbooks for capacity exhaustion scenarios with clear role assignments
Initiating emergency scaling procedures while maintaining system stability under duress
Rolling back recent deployments that caused unexpected capacity spikes
Engaging model owners to optimize inefficient inference patterns during outages
Documenting post-mortems that link capacity incidents to specific architectural or operational decisions
Updating forecasting models with incident data to improve future predictions
Validating recovery by measuring stabilization of key metrics (latency, error rate, queue depth)
Rebalancing workloads across clusters to restore redundancy after failover events

Module 9: Cross-Functional Capacity Governance

Establishing capacity review boards with representation from infrastructure, ML, and product teams
Defining capacity SLIs and SLOs that align technical performance with business outcomes
Requiring capacity impact assessments for all new model deployments
Creating standardized capacity testing protocols for vendor and third-party models
Enforcing model registration requirements that include efficiency benchmarks
Coordinating capacity planning cycles with fiscal and product roadmaps
Developing escalation paths for capacity conflicts between business units
Auditing compliance with data retention policies that impact storage and processing capacity