This curriculum spans the technical and organisational complexity of a multi-workshop program focused on building and operating data platforms in large enterprises, covering the full lifecycle from pipeline architecture and governance to cost optimisation and cross-team coordination.
Module 1: Defining Data Pipeline Architecture at Scale
- Selecting between batch, micro-batch, and streaming ingestion based on SLA requirements and source system capabilities
- Designing idempotent processing stages to handle duplicate message delivery in distributed queues
- Choosing serialization formats (Avro, Parquet, Protobuf) based on schema evolution needs and downstream consumption patterns
- Implementing schema registry integration to enforce backward and forward compatibility in evolving data contracts
- Partitioning strategies for large fact tables to balance query performance and storage efficiency
- Configuring retry mechanisms with exponential backoff for transient failures in cross-system communication
- Establishing data lineage tracking at the field level for auditability and impact analysis
- Defining ownership and stewardship roles for pipeline components across data engineering and domain teams
Module 2: Orchestration Framework Selection and Configuration
- Evaluating Airflow, Prefect, or Dagster based on team expertise, observability needs, and dynamic DAG generation requirements
- Designing DAG structure to minimize cross-dependencies while maintaining business process integrity
- Implementing dynamic task generation for parameterized workflows processing multiple tenants or regions
- Setting up alerting thresholds for task duration, backfill volume, and sensor timeouts
- Securing connections and variables using role-based access and backend-secrets integration
- Managing deployment workflows for DAG updates using CI/CD with rollback capability
- Handling backfill operations without overloading downstream systems or storage tiers
- Integrating custom operators for legacy systems lacking native connector support
Module 3: Data Quality and Validation Engineering
- Embedding Great Expectations or custom validators into pipeline stages for early failure detection
- Defining acceptable null rates and distribution bounds for critical business metrics
- Implementing quarantine zones for records failing validation without blocking pipeline progression
- Automating schema drift detection and routing alerts to data stewards for review
- Configuring sampling strategies for validation on multi-terabyte datasets
- Establishing escalation paths for recurring data quality incidents tied to upstream systems
- Versioning data expectations to support A/B testing and gradual rollout of new rules
- Measuring validation overhead and optimizing execution to avoid pipeline bottlenecks
Module 4: Metadata Management and Discovery
- Integrating automated metadata extraction from ETL jobs into centralized catalog (e.g., DataHub, Atlas)
- Mapping technical column names to business glossary terms with ownership attribution
- Scheduling freshness checks and surfacing stale datasets in discovery interfaces
- Implementing access-controlled metadata views based on user roles and data classification
- Tracking dataset dependencies to assess impact of deprecation or schema changes
- Enabling user feedback mechanisms (e.g., ratings, annotations) on dataset usability
- Harvesting usage statistics from query logs to prioritize curation efforts
- Automating PII detection and tagging to support compliance workflows
Module 5: Scalable Storage and Data Lakehouse Design
- Choosing between Delta Lake, Iceberg, and Hudi based on transactional requirements and compute engine compatibility
- Implementing time-travel and version rollback capabilities for debugging and recovery
- Configuring partitioning and clustering keys to optimize query performance on large tables
- Setting up lifecycle policies to transition cold data to lower-cost storage tiers
- Enforcing file size targets to prevent small-file problems in object storage
- Designing folder structures that support efficient partition pruning and access control
- Managing concurrent writes using optimistic concurrency control or locking mechanisms
- Validating data integrity after compaction and optimization jobs
Module 6: Monitoring, Alerting, and Incident Response
- Instrumenting pipelines with structured logging and distributed tracing for root cause analysis
- Defining SLOs for pipeline completion time and data freshness with error budget policies
- Creating alert suppression rules to avoid noise during scheduled maintenance windows
- Integrating with incident management systems (e.g., PagerDuty, Opsgenie) for on-call rotation
- Setting up dashboards for pipeline health, backlog volume, and resource utilization
- Conducting blameless postmortems for major data incidents with action item tracking
- Simulating failure scenarios (e.g., source outage, schema change) in staging environments
- Documenting runbooks for common failure modes and recovery procedures
Module 7: Access Control and Data Governance
- Implementing row-level and column-level security in query engines (e.g., Presto, Spark SQL)
- Mapping business roles to data access policies using attribute-based access control (ABAC)
- Integrating with enterprise identity providers (e.g., Okta, Azure AD) for SSO and provisioning
- Automating access revocation upon employee offboarding or role change
- Auditing data access patterns to detect anomalous or unauthorized queries
- Enforcing data classification labels during ingestion and propagating them through transformations
- Managing consent flags for personal data in multi-region deployments
- Coordinating data retention schedules with legal and compliance teams
Module 8: Cost Management and Resource Optimization
- Right-sizing cluster configurations based on historical workload patterns and peak demand
- Implementing auto-scaling policies for streaming and batch processing frameworks
- Tracking compute and storage costs by team, project, or business unit using tagging
- Optimizing file formats and compression to reduce storage footprint and I/O costs
- Negotiating reserved instance or savings plan commitments for stable workloads
- Identifying and eliminating orphaned datasets and unused pipelines
- Using spot instances for fault-tolerant, non-critical processing jobs
- Conducting regular cost reviews with stakeholders to align spending with business value
Module 9: Change Management and Cross-Team Collaboration
- Establishing change advisory boards for reviewing high-impact data model modifications
- Documenting API contracts and deprecation timelines for shared datasets
- Coordinating schema evolution with consumer teams using versioned endpoints
- Facilitating data domain alignment sessions to resolve ownership disputes
- Creating sandbox environments for testing breaking changes without affecting production
- Standardizing naming conventions and metadata practices across business units
- Managing technical debt in pipelines through scheduled refactoring windows
- Integrating data change notifications into team communication platforms (e.g., Slack, Teams)