Description

This curriculum spans the technical breadth of a multi-workshop program focused on production-grade big data systems, covering the same depth of implementation detail encountered in enterprise advisory engagements for data architecture, real-time processing, and ML operations.

Module 1: Data Ingestion Architecture at Scale

Designing schema-on-read pipelines for heterogeneous data sources including IoT streams, transaction logs, and clickstreams.
Selecting between batch and micro-batch ingestion based on SLA requirements and downstream processing constraints.
Implementing idempotent data ingestion to handle duplicate messages in distributed messaging systems like Kafka.
Configuring retry logic and dead-letter queues for failed records in streaming pipelines without disrupting throughput.
Integrating change data capture (CDC) from OLTP databases while minimizing replication lag and source system load.
Managing schema evolution in Parquet or Avro formats across ingestion layers to maintain backward compatibility.
Securing data in transit using mutual TLS and encrypting sensitive payloads before ingestion.
Monitoring ingestion pipeline backpressure and tuning consumer group concurrency in real time.

Module 2: Distributed Data Storage and Partitioning Strategies

Choosing between data lakehouse formats (Delta Lake, Iceberg, Hudi) based on ACID compliance and time travel requirements.
Designing partitioning and bucketing schemes to optimize query performance on petabyte-scale datasets.
Implementing lifecycle policies for automated tiering from hot to cold storage based on access patterns.
Configuring replication factors and erasure coding in HDFS clusters for fault tolerance versus storage cost.
Enforcing column-level encryption for PII fields in storage while maintaining query efficiency.
Managing file size distribution to prevent small file problems in object storage systems.
Integrating metadata catalogs like AWS Glue or Apache Atlas for centralized schema discovery.
Handling schema drift detection and resolution in unstructured JSON logs stored in cloud storage.

Module 3: Real-Time Stream Processing with State Management

Designing stateful stream processing jobs in Flink or Spark Structured Streaming with fault-tolerant checkpoints.
Choosing between keyed and non-keyed state to manage session windows for user behavior analysis.
Optimizing state backend configuration (RocksDB vs. in-memory) based on state size and access patterns.
Implementing watermark strategies to balance event-time processing accuracy and latency.
Handling out-of-order events in financial transaction streams with bounded delay policies.
Scaling stateful applications across cluster nodes while managing state redistribution overhead.
Securing state snapshots stored in cloud storage with customer-managed encryption keys.
Monitoring state growth and setting alerts for unbounded state accumulation in long-running jobs.

Module 4: Feature Engineering and Management in Production

Building feature pipelines that support both batch and real-time serving with consistent transformations.
Versioning feature sets and tracking lineage from raw data to model input for auditability.
Implementing feature stores with low-latency online serving for real-time inference use cases.
Validating feature distributions in production to detect data drift and skew.
Managing feature access controls and masking sensitive attributes in shared environments.
Optimizing feature computation cost by caching and pre-aggregating high-frequency calculations.
Handling missing values and outliers in feature pipelines with domain-specific imputation logic.
Integrating feature freshness SLAs with downstream model retraining schedules.

Module 5: Scalable Model Training and Hyperparameter Optimization

Distributing model training across GPU clusters using Horovod or native PyTorch DDP.
Designing data-parallel versus model-parallel strategies for large embedding models.
Implementing early stopping and checkpointing in distributed hyperparameter tuning jobs.
Managing resource contention in shared training clusters using Kubernetes quotas and priorities.
Optimizing data loading pipelines with prefetching and parallel I/O to avoid GPU underutilization.
Tracking experiments with metadata including code version, hyperparameters, and dataset checksums.
Securing access to training data containing regulated information using role-based access controls.
Reducing training costs by scheduling jobs during off-peak cloud pricing windows.

Module 6: Model Deployment and Serving Infrastructure

Choosing between REST, gRPC, or message queue interfaces for model serving based on latency and throughput.
Implementing canary rollouts for model versions with traffic shadowing and A/B testing.
Configuring autoscaling policies for inference endpoints based on request rate and GPU utilization.
Managing cold start delays in serverless inference platforms by maintaining warm instances.
Integrating model explainability outputs into real-time API responses for compliance use cases.
Enforcing model signing and integrity checks before deployment to prevent tampering.
Handling model version rollback procedures when performance degrades in production.
Optimizing model serialization formats (ONNX, TorchScript) for cross-platform inference.

Module 7: Monitoring, Observability, and Drift Detection

Instrumenting model inference pipelines with structured logging and distributed tracing.
Setting up statistical process control charts for model prediction drift over time.
Correlating feature drift with model performance degradation using historical baselines.
Implementing shadow mode inference to compare new models against production without affecting users.
Configuring alert thresholds for abnormal prediction latency or error rate spikes.
Aggregating and storing inference payloads for retrospective analysis under data retention policies.
Integrating monitoring dashboards with incident response workflows in PagerDuty or Opsgenie.
Managing sampling rates for logging predictions to balance observability and storage cost.

Module 8: Data and Model Governance

Implementing data lineage tracking from source to model output using automated metadata collection.
Enforcing model approval workflows with audit trails for regulated industries.
Classifying data sensitivity levels and applying masking or tokenization in training environments.
Managing model inventory with ownership, deprecation schedules, and usage metrics.
Conducting periodic model risk assessments for fairness, bias, and regulatory compliance.
Integrating data governance tools like Great Expectations for validation in production pipelines.
Documenting model assumptions, limitations, and known failure modes in machine-readable formats.
Coordinating cross-functional reviews between legal, security, and engineering teams before deployment.

Module 9: Cost Optimization and Resource Management

Right-sizing compute clusters for data processing jobs using historical utilization metrics.
Implementing spot instance strategies for fault-tolerant batch workloads with checkpointing.
Optimizing data serialization and compression to reduce network egress costs.
Managing storage costs by enforcing data retention and archival policies automatically.
Consolidating model inference workloads using multi-model serving endpoints.
Tracking cost attribution by team, project, or model using cloud tagging and cost allocation keys.
Designing data compaction jobs to reduce query costs in cloud data warehouses.
Automating shutdown of non-production environments during off-hours to reduce spend.