This curriculum spans the technical breadth of a multi-workshop program on enterprise data modeling, covering the same depth of architectural decision-making and implementation trade-offs encountered in advisory engagements for large-scale data platforms.
Module 1: Foundations of Big Data Modeling and System Architecture
- Selecting between batch and streaming data models based on SLA requirements and downstream consumption patterns
- Defining data partitioning strategies in distributed file systems to balance query performance and storage overhead
- Choosing appropriate cluster managers (YARN, Kubernetes) based on workload isolation and resource scheduling needs
- Implementing schema-on-read versus schema-on-write based on ingestion velocity and data consumer flexibility
- Designing data lake zones (raw, curated, trusted) to support auditability and incremental transformation
- Integrating metadata management early in the architecture to enable lineage tracking and schema evolution
- Configuring replication factors in HDFS or object storage to meet fault tolerance and read performance targets
- Aligning data modeling choices with underlying compute engine capabilities (e.g., Spark SQL, Presto, Flink)
Module 2: Data Modeling Patterns for Scalable Storage
- Applying star schema modeling in data warehouses while managing denormalization trade-offs for query speed
- Implementing slowly changing dimensions (Type 2) with efficient delta detection and history retention policies
- Designing nested JSON or Parquet structures for semi-structured data while controlling predicate pushdown efficiency
- Choosing between wide-column stores and document models based on access patterns and update frequency
- Optimizing row group and page sizes in columnar formats for scan-heavy versus point-query workloads
- Modeling time-series data with time-based partitioning and compaction strategies to manage storage growth
- Structuring graph data models in property graphs versus RDF triples based on query engine support
- Using surrogate keys in distributed environments to avoid key collisions across data sources
Module 3: Schema Design and Evolution in Distributed Systems
- Enforcing schema validation at ingestion using Avro or Protobuf with schema registry integration
- Managing backward and forward compatibility during schema evolution in streaming pipelines
- Handling schema drift from source systems by implementing automated alerting and fallback mechanisms
- Versioning schemas in metadata repositories to support point-in-time data reconstruction
- Designing union types and optional fields to accommodate heterogeneous data without breaking pipelines
- Coordinating schema changes across multiple consuming services to prevent processing failures
- Using schema inference cautiously in production pipelines due to performance and consistency risks
- Implementing schema migration strategies for existing datasets during format or structure changes
Module 4: Data Partitioning and Distribution Strategies
- Selecting partition keys that avoid data skew while supporting common query filters
- Implementing dynamic partition pruning in query engines to reduce I/O in large tables
- Managing partition explosion by limiting cardinality and using bucketing for high-cardinality dimensions
- Choosing between range, hash, and list partitioning based on query patterns and data distribution
- Repartitioning data during ETL to align with downstream join and aggregation performance needs
- Handling time-based partitioning across multiple time zones with consistent UTC alignment
- Designing composite partitioning strategies for multi-dimensional access patterns
- Monitoring partition size distribution to prevent small file problems and optimize compaction
Module 5: Performance Optimization in Query and Storage Layers
- Configuring file formats (Parquet, ORC, Delta Lake) with optimal compression and encoding settings
- Implementing Z-order indexing or data skipping in Delta Lake to accelerate multi-column queries
- Pre-aggregating metrics in materialized views while managing update latency and storage cost
- Optimizing join strategies (broadcast, shuffle, sort-merge) based on dataset size and cluster resources
- Using predicate pushdown effectively by aligning filters with partition and sort keys
- Tuning Spark executor memory and parallelism to match data volume and cluster topology
- Implementing caching strategies for frequently accessed datasets in memory or SSD layers
- Monitoring query execution plans to identify bottlenecks in data shuffling and I/O
Module 6: Data Governance and Metadata Management
- Implementing column-level lineage tracking to support impact analysis and compliance audits
- Classifying sensitive data fields and enforcing masking policies at query runtime
- Integrating data catalogs (e.g., Apache Atlas, DataHub) with ETL pipelines for automated metadata capture
- Defining data ownership and stewardship roles for datasets across business units
- Enforcing data quality rules at ingestion with failure thresholds and quarantine zones
- Managing retention policies for different data tiers based on regulatory and business requirements
- Documenting data definitions and business context in a centralized glossary linked to technical metadata
- Automating metadata validation to detect undocumented or orphaned datasets
Module 7: Real-Time Data Modeling and Streaming Architectures
- Designing event schemas with immutable facts and explicit event time for temporal consistency
- Choosing between Kafka Streams, Flink, or Spark Structured Streaming based on state management needs
- Modeling changelog streams for CDC data with tombstone handling and compaction policies
- Implementing watermarking strategies to balance latency and completeness in windowed aggregations
- Structuring stream-table joins to handle late-arriving data and state expiration
- Defining key semantics for stateful operations to prevent skew in keyed stream processing
- Modeling session windows for user behavior analysis with configurable inactivity gaps
- Integrating streaming data into batch systems using micro-batch ingestion with consistency checks
Module 8: Scalable Data Integration and Pipeline Design
- Designing idempotent ingestion processes to handle retries without data duplication
- Implementing change data capture from RDBMS sources using log-based tools like Debezium
- Orchestrating complex dependencies in data pipelines using Airflow or Dagster with failure recovery
- Validating data consistency across batch and streaming pipelines using reconciliation jobs
- Handling schema divergence between source systems and target models through transformation layers
- Monitoring pipeline latency and throughput to detect degradation before SLA breaches
- Securing data in transit and at rest using encryption and access control integration
- Scaling ingestion pipelines horizontally to handle peak load from high-volume sources
Module 9: Advanced Topics in Data Modeling and AI Readiness
- Preparing feature stores with consistent time alignment for offline and online model training
- Designing entity resolution models to unify customer data across disparate sources
- Structuring labeled datasets for supervised learning with versioned ground truth
- Implementing data versioning using DVC or Delta Lake Time Travel for reproducible experiments
- Modeling time-series features with lagged variables and rolling aggregations for forecasting
- Ensuring feature consistency between training and serving environments to prevent skew
- Partitioning training data to avoid leakage across time or entities in model evaluation
- Generating synthetic data to augment rare events while preserving statistical properties