This curriculum spans the technical and organisational complexity of a multi-workshop program for building enterprise-grade data platforms, addressing the same design decisions and trade-offs encountered in large-scale data engineering engagements across ingestion, transformation, storage, and governance.
Module 1: Defining Big Data Process Requirements and Scope
- Selecting data ingestion sources based on SLA commitments, data freshness needs, and downstream system dependencies
- Negotiating data ownership and access rights with legal and compliance teams for cross-departmental datasets
- Determining batch vs. real-time processing requirements based on business use case latency thresholds
- Mapping data lineage requirements at project initiation to meet future audit and regulatory obligations
- Establishing data volume thresholds that trigger architectural changes (e.g., from SQL to NoSQL or streaming)
- Documenting exception handling expectations for missing, malformed, or delayed data inputs
- Aligning process scope with existing enterprise data governance frameworks and metadata standards
- Identifying key stakeholders for sign-off on process boundaries and data ownership models
Module 2: Architecting Scalable Data Ingestion Pipelines
- Choosing between pull-based and push-based ingestion models based on source system capabilities and network constraints
- Implementing retry logic and backpressure mechanisms in Kafka consumers to handle downstream outages
- Designing schema evolution strategies for Avro or Protobuf in message queues to support backward compatibility
- Configuring secure authentication (e.g., OAuth, mTLS) between ingestion tools and source systems
- Partitioning strategies for Kafka topics based on throughput, ordering requirements, and consumer concurrency
- Implementing dead-letter queues for failed records with monitoring and alerting on queue depth
- Estimating and provisioning network bandwidth for high-volume ingestion from IoT or log sources
- Validating data integrity at ingestion using checksums or hash comparisons
Module 3: Data Transformation and Workflow Orchestration
- Selecting orchestration tools (e.g., Airflow, Prefect, Dagster) based on scheduling complexity and observability needs
- Defining idempotent transformation logic to support safe pipeline retries without data duplication
- Implementing incremental data processing using watermarking and change data capture (CDC) techniques
- Managing dependency chains across heterogeneous environments (Spark, Python, SQL) in a single DAG
- Configuring retry policies and timeout thresholds for long-running transformation jobs
- Version-controlling ETL code and configuration using Git with branching strategies for testing and production
- Embedding data quality checks (e.g., null rate, value distribution) within transformation workflows
- Optimizing shuffle operations in Spark by tuning partition counts and broadcast joins
Module 4: Storage Architecture for Structured and Unstructured Data
- Selecting file formats (Parquet, ORC, JSON) based on query patterns, compression needs, and schema evolution
- Designing partitioning and bucketing strategies in data lakes to reduce scan costs and improve query performance
- Implementing lifecycle policies for object storage (e.g., S3 Glacier transitions) to manage cost and compliance
- Choosing between data lake and data warehouse based on query concurrency, ACID requirements, and user access patterns
- Configuring access controls at the object and column level using IAM roles and Apache Ranger policies
- Planning for metadata management using centralized catalogs (e.g., AWS Glue, Unity Catalog)
- Designing schema registry integration for enforcing consistency across streaming and batch pipelines
- Replicating data across regions for disaster recovery while managing egress costs and latency
Module 5: Real-Time Stream Processing Design
- Selecting stream processing engines (Flink, Spark Streaming, Kafka Streams) based on exactly-once semantics needs
- Designing windowing strategies (tumbling, sliding, session) to match business event aggregation logic
- Managing state storage size and checkpointing frequency to balance recovery time and performance
- Handling out-of-order events using watermarks and late-arrival buffers in time-based aggregations
- Integrating stream joins with dimension data stored in external databases or state stores
- Scaling consumer groups dynamically based on lag metrics and throughput requirements
- Implementing end-to-end latency monitoring with distributed tracing across microservices
- Securing stream data in transit and at rest using encryption and access policies
Module 6: Data Quality and Validation Engineering
- Embedding data validation rules (e.g., uniqueness, referential integrity) at ingestion and transformation stages
- Configuring automated alerting on data quality rule violations using tools like Great Expectations or Deequ
- Establishing data reconciliation processes between source and target systems for critical pipelines
- Defining acceptable data drift thresholds for statistical profiles and triggering retraining workflows
- Implementing data profiling jobs to detect schema changes or unexpected value distributions
- Managing false positive rates in data quality alerts to maintain operational trust
- Documenting data quality SLAs and escalation paths for unresolved issues
- Versioning data validation rules to support auditability and rollback capabilities
Module 7: Metadata Management and Data Lineage
- Integrating automated lineage capture tools (e.g., Marquez, DataHub) with orchestration and ETL platforms
- Mapping technical lineage (table-to-table) and business lineage (KPI-to-source) for regulatory reporting
- Standardizing metadata tagging for data domains, sensitivity levels, and stewardship ownership
- Implementing search and discovery features over metadata to reduce time-to-insight for analysts
- Handling lineage gaps in legacy systems that lack instrumentation or logging capabilities
- Synchronizing metadata across environments (dev, staging, prod) using CI/CD pipelines
- Defining retention policies for operational metadata (e.g., job execution logs, query history)
- Exposing lineage data to data governance teams via API or reporting dashboards
Module 8: Performance Tuning and Cost Optimization
- Right-sizing cluster resources (CPU, memory, disk) based on historical job profiling and peak loads
- Implementing autoscaling policies for cloud data platforms (e.g., Databricks, BigQuery) with cost caps
- Optimizing query performance through predicate pushdown, column pruning, and indexing strategies
- Reducing data transfer costs by co-locating compute and storage in the same region
- Monitoring and controlling scan-to-result ratios in analytical queries to prevent wasteful processing
- Using materialized views or pre-aggregated tables to accelerate frequent reporting queries
- Identifying and eliminating orphaned data pipelines that consume resources but serve no active use case
- Conducting regular cost attribution reviews to allocate spending by team, project, or business unit
Module 9: Governance, Security, and Compliance in Data Processes
- Implementing role-based access control (RBAC) and attribute-based access control (ABAC) for data assets
- Masking or tokenizing PII fields in non-production environments using dynamic data masking rules
- Configuring audit logging for data access and modification events across storage and compute layers
- Enforcing data retention and deletion policies in alignment with GDPR, CCPA, or industry mandates
- Conducting data protection impact assessments (DPIAs) for high-risk processing activities
- Integrating data classification tools to automatically tag sensitive data at rest and in motion
- Managing encryption key rotation and access for data-at-rest using cloud KMS or on-prem HSMs
- Coordinating data breach response procedures with incident management teams and legal counsel