Description

This curriculum spans the technical and organisational complexity of a multi-workshop program focused on building and operating data platforms in large enterprises, covering the full lifecycle from pipeline architecture and governance to cost optimisation and cross-team coordination.

Module 1: Defining Data Pipeline Architecture at Scale

Selecting between batch, micro-batch, and streaming ingestion based on SLA requirements and source system capabilities
Designing idempotent processing stages to handle duplicate message delivery in distributed queues
Choosing serialization formats (Avro, Parquet, Protobuf) based on schema evolution needs and downstream consumption patterns
Implementing schema registry integration to enforce backward and forward compatibility in evolving data contracts
Partitioning strategies for large fact tables to balance query performance and storage efficiency
Configuring retry mechanisms with exponential backoff for transient failures in cross-system communication
Establishing data lineage tracking at the field level for auditability and impact analysis
Defining ownership and stewardship roles for pipeline components across data engineering and domain teams

Module 2: Orchestration Framework Selection and Configuration

Evaluating Airflow, Prefect, or Dagster based on team expertise, observability needs, and dynamic DAG generation requirements
Designing DAG structure to minimize cross-dependencies while maintaining business process integrity
Implementing dynamic task generation for parameterized workflows processing multiple tenants or regions
Setting up alerting thresholds for task duration, backfill volume, and sensor timeouts
Securing connections and variables using role-based access and backend-secrets integration
Managing deployment workflows for DAG updates using CI/CD with rollback capability
Handling backfill operations without overloading downstream systems or storage tiers
Integrating custom operators for legacy systems lacking native connector support

Module 3: Data Quality and Validation Engineering

Embedding Great Expectations or custom validators into pipeline stages for early failure detection
Defining acceptable null rates and distribution bounds for critical business metrics
Implementing quarantine zones for records failing validation without blocking pipeline progression
Automating schema drift detection and routing alerts to data stewards for review
Configuring sampling strategies for validation on multi-terabyte datasets
Establishing escalation paths for recurring data quality incidents tied to upstream systems
Versioning data expectations to support A/B testing and gradual rollout of new rules
Measuring validation overhead and optimizing execution to avoid pipeline bottlenecks

Module 4: Metadata Management and Discovery

Integrating automated metadata extraction from ETL jobs into centralized catalog (e.g., DataHub, Atlas)
Mapping technical column names to business glossary terms with ownership attribution
Scheduling freshness checks and surfacing stale datasets in discovery interfaces
Implementing access-controlled metadata views based on user roles and data classification
Tracking dataset dependencies to assess impact of deprecation or schema changes
Enabling user feedback mechanisms (e.g., ratings, annotations) on dataset usability
Harvesting usage statistics from query logs to prioritize curation efforts
Automating PII detection and tagging to support compliance workflows

Module 5: Scalable Storage and Data Lakehouse Design

Choosing between Delta Lake, Iceberg, and Hudi based on transactional requirements and compute engine compatibility
Implementing time-travel and version rollback capabilities for debugging and recovery
Configuring partitioning and clustering keys to optimize query performance on large tables
Setting up lifecycle policies to transition cold data to lower-cost storage tiers
Enforcing file size targets to prevent small-file problems in object storage
Designing folder structures that support efficient partition pruning and access control
Managing concurrent writes using optimistic concurrency control or locking mechanisms
Validating data integrity after compaction and optimization jobs

Module 6: Monitoring, Alerting, and Incident Response

Instrumenting pipelines with structured logging and distributed tracing for root cause analysis
Defining SLOs for pipeline completion time and data freshness with error budget policies
Creating alert suppression rules to avoid noise during scheduled maintenance windows
Integrating with incident management systems (e.g., PagerDuty, Opsgenie) for on-call rotation
Setting up dashboards for pipeline health, backlog volume, and resource utilization
Conducting blameless postmortems for major data incidents with action item tracking
Simulating failure scenarios (e.g., source outage, schema change) in staging environments
Documenting runbooks for common failure modes and recovery procedures

Module 7: Access Control and Data Governance

Implementing row-level and column-level security in query engines (e.g., Presto, Spark SQL)
Mapping business roles to data access policies using attribute-based access control (ABAC)
Integrating with enterprise identity providers (e.g., Okta, Azure AD) for SSO and provisioning
Automating access revocation upon employee offboarding or role change
Auditing data access patterns to detect anomalous or unauthorized queries
Enforcing data classification labels during ingestion and propagating them through transformations
Managing consent flags for personal data in multi-region deployments
Coordinating data retention schedules with legal and compliance teams

Module 8: Cost Management and Resource Optimization

Right-sizing cluster configurations based on historical workload patterns and peak demand
Implementing auto-scaling policies for streaming and batch processing frameworks
Tracking compute and storage costs by team, project, or business unit using tagging
Optimizing file formats and compression to reduce storage footprint and I/O costs
Negotiating reserved instance or savings plan commitments for stable workloads
Identifying and eliminating orphaned datasets and unused pipelines
Using spot instances for fault-tolerant, non-critical processing jobs
Conducting regular cost reviews with stakeholders to align spending with business value

Module 9: Change Management and Cross-Team Collaboration

Establishing change advisory boards for reviewing high-impact data model modifications
Documenting API contracts and deprecation timelines for shared datasets
Coordinating schema evolution with consumer teams using versioned endpoints
Facilitating data domain alignment sessions to resolve ownership disputes
Creating sandbox environments for testing breaking changes without affecting production
Standardizing naming conventions and metadata practices across business units
Managing technical debt in pipelines through scheduled refactoring windows
Integrating data change notifications into team communication platforms (e.g., Slack, Teams)