This curriculum spans the technical and operational complexity of a multi-workshop program focused on building and operating enterprise-grade data integration systems, comparable to the internal capability programs used to scale data platforms across large organisations.
Module 1: Architecting Scalable Data Ingestion Pipelines
- Selecting between batch and streaming ingestion based on SLA requirements and source system capabilities
- Designing idempotent ingestion workflows to handle duplicate messages in distributed systems
- Implementing backpressure mechanisms in Kafka consumers to prevent downstream system overload
- Configuring retry strategies with exponential backoff for transient failures in cloud-based APIs
- Partitioning strategies for large-scale file ingestion from object storage to optimize parallel processing
- Validating data schema upon ingestion using schema registry integration with Avro or Protobuf
- Monitoring end-to-end latency across ingestion stages using distributed tracing tools like OpenTelemetry
- Securing data in transit using mTLS and enforcing authentication between ingestion components
Module 2: Data Source Connectivity and Protocol Management
- Choosing appropriate JDBC fetch sizes and connection pooling parameters for high-volume RDBMS extraction
- Implementing OAuth2.0 and API key rotation for secure access to SaaS platforms like Salesforce or Workday
- Handling change data capture (CDC) from PostgreSQL logical replication slots without causing WAL bloat
- Configuring gRPC endpoints for efficient communication with microservices exposing real-time data
- Managing API rate limits and quotas when extracting from third-party RESTful services
- Integrating mainframe data via MQ series with proper message segmentation and recovery logic
- Using SSH tunneling for secure access to on-premises databases in hybrid cloud environments
- Normalizing data from heterogeneous sources with inconsistent timestamp formats and encodings
Module 3: Schema Evolution and Data Modeling
- Defining backward- and forward-compatible schema changes in a schema registry for Avro-based pipelines
- Mapping nested JSON structures from application logs into flattened Parquet schemas for analytics
- Implementing slowly changing dimensions (SCD Type 2) in data lakehouse environments using Delta Lake
- Resolving naming conflicts and semantic mismatches across departments during enterprise data modeling
- Enforcing data type consistency when merging datasets from sources with implicit typing (e.g., CSV)
- Designing partitioning and bucketing strategies in Hive-compatible metastores for query performance
- Managing schema drift detection using statistical profiling and automated alerting
- Versioning data models in source control and coordinating deployment across staging and production
Module 4: Data Quality and Validation Frameworks
- Embedding Great Expectations or Soda Core checks into ingestion DAGs for early failure detection
- Configuring nullability rules per field based on business-criticality and downstream dependencies
- Implementing referential integrity checks across distributed datasets without foreign key constraints
- Quantifying data completeness by comparing record counts against upstream source audit logs
- Setting thresholds for anomaly detection on data distributions using historical baselines
- Logging data quality violations to a centralized observability platform for root cause analysis
- Handling quarantine workflows for invalid records with manual review and reprocessing paths
- Defining SLAs for data freshness and accuracy in service-level agreements with data consumers
Module 5: Metadata Management and Lineage Tracking
- Instrumenting pipeline steps to emit operational metadata (e.g., row counts, timestamps) to a metadata store
- Integrating Apache Atlas or Marquez for end-to-end lineage across batch and streaming jobs
- Automating schema documentation extraction and publishing to a data catalog on pipeline deployment
- Mapping business glossary terms to technical column names using a stewardship workflow
- Tracking data ownership and PII classification tags through automated scanning and manual override
- Implementing impact analysis features to assess downstream effects of source schema changes
- Enabling search and discovery of datasets using full-text and faceted search in a catalog UI
- Archiving and versioning metadata snapshots to support audit and compliance requirements
Module 6: Security, Privacy, and Access Governance
- Implementing row- and column-level security in query engines like Presto or Spark SQL
- Masking PII fields dynamically using deterministic encryption or hashing in reporting layers
- Integrating with enterprise IAM systems (e.g., Okta, Azure AD) for attribute-based access control
- Auditing access patterns to sensitive datasets using query log monitoring and alerting
- Applying data retention policies based on legal hold requirements and GDPR right-to-erasure
- Encrypting data at rest using customer-managed keys in cloud storage services
- Conducting data protection impact assessments (DPIAs) for new high-risk data integrations
- Enforcing secure handoffs between teams using signed data release manifests
Module 7: Orchestration and Pipeline Operations
- Designing Airflow DAGs with proper task dependencies, retries, and SLA monitoring
- Parameterizing pipelines to support multi-environment deployment (dev, test, prod) without code changes
- Implementing pipeline idempotency to allow safe reruns after partial failures
- Managing cross-DAG dependencies using external task sensors or message-based triggers
- Scaling executor types (Kubernetes, Celery) based on workload concurrency and resource demands
- Integrating pipeline logs with centralized logging systems (e.g., ELK, Datadog) for troubleshooting
- Scheduling backfills with controlled concurrency to avoid resource contention
- Automating pipeline testing using synthetic datasets and mock sources in CI/CD pipelines
Module 8: Performance Optimization and Cost Control
- Tuning Spark executors (memory, cores, instances) based on data volume and shuffle behavior
- Optimizing file sizes in data lakes to balance query performance and storage costs
- Implementing predicate pushdown and column pruning in source connectors for efficient filtering
- Choosing between format options (Parquet, ORC, Iceberg) based on update patterns and query workloads
- Monitoring cloud data transfer costs and minimizing cross-region egress in hybrid architectures
- Using materialized views or pre-aggregated tables to accelerate frequent analytical queries
- Right-sizing cluster nodes in managed services (e.g., EMR, Databricks) to match workload profiles
- Implementing auto-scaling policies for streaming consumers based on lag metrics
Module 9: Monitoring, Alerting, and Incident Response
- Defining critical metrics (e.g., lag, throughput, error rate) for real-time pipeline dashboards
- Setting dynamic alert thresholds using statistical baselines instead of static values
- Correlating pipeline failures with infrastructure metrics (CPU, memory, network) for root cause analysis
- Implementing health checks for dependent services to detect cascading failures early
- Creating runbooks with step-by-step recovery procedures for common failure scenarios
- Automating rollback of pipeline deployments using versioned artifacts and infrastructure as code
- Conducting blameless postmortems and tracking remediation tasks in a centralized system
- Simulating failure scenarios (e.g., source outage, schema change) in staging for resilience testing