This curriculum spans the technical breadth of a multi-workshop program focused on production-grade big data systems, covering the same depth of implementation detail encountered in enterprise advisory engagements for data architecture, real-time processing, and ML operations.
Module 1: Data Ingestion Architecture at Scale
- Designing schema-on-read pipelines for heterogeneous data sources including IoT streams, transaction logs, and clickstreams.
- Selecting between batch and micro-batch ingestion based on SLA requirements and downstream processing constraints.
- Implementing idempotent data ingestion to handle duplicate messages in distributed messaging systems like Kafka.
- Configuring retry logic and dead-letter queues for failed records in streaming pipelines without disrupting throughput.
- Integrating change data capture (CDC) from OLTP databases while minimizing replication lag and source system load.
- Managing schema evolution in Parquet or Avro formats across ingestion layers to maintain backward compatibility.
- Securing data in transit using mutual TLS and encrypting sensitive payloads before ingestion.
- Monitoring ingestion pipeline backpressure and tuning consumer group concurrency in real time.
Module 2: Distributed Data Storage and Partitioning Strategies
- Choosing between data lakehouse formats (Delta Lake, Iceberg, Hudi) based on ACID compliance and time travel requirements.
- Designing partitioning and bucketing schemes to optimize query performance on petabyte-scale datasets.
- Implementing lifecycle policies for automated tiering from hot to cold storage based on access patterns.
- Configuring replication factors and erasure coding in HDFS clusters for fault tolerance versus storage cost.
- Enforcing column-level encryption for PII fields in storage while maintaining query efficiency.
- Managing file size distribution to prevent small file problems in object storage systems.
- Integrating metadata catalogs like AWS Glue or Apache Atlas for centralized schema discovery.
- Handling schema drift detection and resolution in unstructured JSON logs stored in cloud storage.
Module 3: Real-Time Stream Processing with State Management
- Designing stateful stream processing jobs in Flink or Spark Structured Streaming with fault-tolerant checkpoints.
- Choosing between keyed and non-keyed state to manage session windows for user behavior analysis.
- Optimizing state backend configuration (RocksDB vs. in-memory) based on state size and access patterns.
- Implementing watermark strategies to balance event-time processing accuracy and latency.
- Handling out-of-order events in financial transaction streams with bounded delay policies.
- Scaling stateful applications across cluster nodes while managing state redistribution overhead.
- Securing state snapshots stored in cloud storage with customer-managed encryption keys.
- Monitoring state growth and setting alerts for unbounded state accumulation in long-running jobs.
Module 4: Feature Engineering and Management in Production
- Building feature pipelines that support both batch and real-time serving with consistent transformations.
- Versioning feature sets and tracking lineage from raw data to model input for auditability.
- Implementing feature stores with low-latency online serving for real-time inference use cases.
- Validating feature distributions in production to detect data drift and skew.
- Managing feature access controls and masking sensitive attributes in shared environments.
- Optimizing feature computation cost by caching and pre-aggregating high-frequency calculations.
- Handling missing values and outliers in feature pipelines with domain-specific imputation logic.
- Integrating feature freshness SLAs with downstream model retraining schedules.
Module 5: Scalable Model Training and Hyperparameter Optimization
- Distributing model training across GPU clusters using Horovod or native PyTorch DDP.
- Designing data-parallel versus model-parallel strategies for large embedding models.
- Implementing early stopping and checkpointing in distributed hyperparameter tuning jobs.
- Managing resource contention in shared training clusters using Kubernetes quotas and priorities.
- Optimizing data loading pipelines with prefetching and parallel I/O to avoid GPU underutilization.
- Tracking experiments with metadata including code version, hyperparameters, and dataset checksums.
- Securing access to training data containing regulated information using role-based access controls.
- Reducing training costs by scheduling jobs during off-peak cloud pricing windows.
Module 6: Model Deployment and Serving Infrastructure
- Choosing between REST, gRPC, or message queue interfaces for model serving based on latency and throughput.
- Implementing canary rollouts for model versions with traffic shadowing and A/B testing.
- Configuring autoscaling policies for inference endpoints based on request rate and GPU utilization.
- Managing cold start delays in serverless inference platforms by maintaining warm instances.
- Integrating model explainability outputs into real-time API responses for compliance use cases.
- Enforcing model signing and integrity checks before deployment to prevent tampering.
- Handling model version rollback procedures when performance degrades in production.
- Optimizing model serialization formats (ONNX, TorchScript) for cross-platform inference.
Module 7: Monitoring, Observability, and Drift Detection
- Instrumenting model inference pipelines with structured logging and distributed tracing.
- Setting up statistical process control charts for model prediction drift over time.
- Correlating feature drift with model performance degradation using historical baselines.
- Implementing shadow mode inference to compare new models against production without affecting users.
- Configuring alert thresholds for abnormal prediction latency or error rate spikes.
- Aggregating and storing inference payloads for retrospective analysis under data retention policies.
- Integrating monitoring dashboards with incident response workflows in PagerDuty or Opsgenie.
- Managing sampling rates for logging predictions to balance observability and storage cost.
Module 8: Data and Model Governance
- Implementing data lineage tracking from source to model output using automated metadata collection.
- Enforcing model approval workflows with audit trails for regulated industries.
- Classifying data sensitivity levels and applying masking or tokenization in training environments.
- Managing model inventory with ownership, deprecation schedules, and usage metrics.
- Conducting periodic model risk assessments for fairness, bias, and regulatory compliance.
- Integrating data governance tools like Great Expectations for validation in production pipelines.
- Documenting model assumptions, limitations, and known failure modes in machine-readable formats.
- Coordinating cross-functional reviews between legal, security, and engineering teams before deployment.
Module 9: Cost Optimization and Resource Management
- Right-sizing compute clusters for data processing jobs using historical utilization metrics.
- Implementing spot instance strategies for fault-tolerant batch workloads with checkpointing.
- Optimizing data serialization and compression to reduce network egress costs.
- Managing storage costs by enforcing data retention and archival policies automatically.
- Consolidating model inference workloads using multi-model serving endpoints.
- Tracking cost attribution by team, project, or model using cloud tagging and cost allocation keys.
- Designing data compaction jobs to reduce query costs in cloud data warehouses.
- Automating shutdown of non-production environments during off-hours to reduce spend.