Description

This curriculum spans the technical and operational complexity of a multi-workshop program focused on building and operating enterprise-grade data integration systems, comparable to the internal capability programs used to scale data platforms across large organisations.

Module 1: Architecting Scalable Data Ingestion Pipelines

Selecting between batch and streaming ingestion based on SLA requirements and source system capabilities
Designing idempotent ingestion workflows to handle duplicate messages in distributed systems
Implementing backpressure mechanisms in Kafka consumers to prevent downstream system overload
Configuring retry strategies with exponential backoff for transient failures in cloud-based APIs
Partitioning strategies for large-scale file ingestion from object storage to optimize parallel processing
Validating data schema upon ingestion using schema registry integration with Avro or Protobuf
Monitoring end-to-end latency across ingestion stages using distributed tracing tools like OpenTelemetry
Securing data in transit using mTLS and enforcing authentication between ingestion components

Module 2: Data Source Connectivity and Protocol Management

Choosing appropriate JDBC fetch sizes and connection pooling parameters for high-volume RDBMS extraction
Implementing OAuth2.0 and API key rotation for secure access to SaaS platforms like Salesforce or Workday
Handling change data capture (CDC) from PostgreSQL logical replication slots without causing WAL bloat
Configuring gRPC endpoints for efficient communication with microservices exposing real-time data
Managing API rate limits and quotas when extracting from third-party RESTful services
Integrating mainframe data via MQ series with proper message segmentation and recovery logic
Using SSH tunneling for secure access to on-premises databases in hybrid cloud environments
Normalizing data from heterogeneous sources with inconsistent timestamp formats and encodings

Module 3: Schema Evolution and Data Modeling

Defining backward- and forward-compatible schema changes in a schema registry for Avro-based pipelines
Mapping nested JSON structures from application logs into flattened Parquet schemas for analytics
Implementing slowly changing dimensions (SCD Type 2) in data lakehouse environments using Delta Lake
Resolving naming conflicts and semantic mismatches across departments during enterprise data modeling
Enforcing data type consistency when merging datasets from sources with implicit typing (e.g., CSV)
Designing partitioning and bucketing strategies in Hive-compatible metastores for query performance
Managing schema drift detection using statistical profiling and automated alerting
Versioning data models in source control and coordinating deployment across staging and production

Module 4: Data Quality and Validation Frameworks

Embedding Great Expectations or Soda Core checks into ingestion DAGs for early failure detection
Configuring nullability rules per field based on business-criticality and downstream dependencies
Implementing referential integrity checks across distributed datasets without foreign key constraints
Quantifying data completeness by comparing record counts against upstream source audit logs
Setting thresholds for anomaly detection on data distributions using historical baselines
Logging data quality violations to a centralized observability platform for root cause analysis
Handling quarantine workflows for invalid records with manual review and reprocessing paths
Defining SLAs for data freshness and accuracy in service-level agreements with data consumers

Module 5: Metadata Management and Lineage Tracking

Instrumenting pipeline steps to emit operational metadata (e.g., row counts, timestamps) to a metadata store
Integrating Apache Atlas or Marquez for end-to-end lineage across batch and streaming jobs
Automating schema documentation extraction and publishing to a data catalog on pipeline deployment
Mapping business glossary terms to technical column names using a stewardship workflow
Tracking data ownership and PII classification tags through automated scanning and manual override
Implementing impact analysis features to assess downstream effects of source schema changes
Enabling search and discovery of datasets using full-text and faceted search in a catalog UI
Archiving and versioning metadata snapshots to support audit and compliance requirements

Module 6: Security, Privacy, and Access Governance

Implementing row- and column-level security in query engines like Presto or Spark SQL
Masking PII fields dynamically using deterministic encryption or hashing in reporting layers
Integrating with enterprise IAM systems (e.g., Okta, Azure AD) for attribute-based access control
Auditing access patterns to sensitive datasets using query log monitoring and alerting
Applying data retention policies based on legal hold requirements and GDPR right-to-erasure
Encrypting data at rest using customer-managed keys in cloud storage services
Conducting data protection impact assessments (DPIAs) for new high-risk data integrations
Enforcing secure handoffs between teams using signed data release manifests

Module 7: Orchestration and Pipeline Operations

Designing Airflow DAGs with proper task dependencies, retries, and SLA monitoring
Parameterizing pipelines to support multi-environment deployment (dev, test, prod) without code changes
Implementing pipeline idempotency to allow safe reruns after partial failures
Managing cross-DAG dependencies using external task sensors or message-based triggers
Scaling executor types (Kubernetes, Celery) based on workload concurrency and resource demands
Integrating pipeline logs with centralized logging systems (e.g., ELK, Datadog) for troubleshooting
Scheduling backfills with controlled concurrency to avoid resource contention
Automating pipeline testing using synthetic datasets and mock sources in CI/CD pipelines

Module 8: Performance Optimization and Cost Control

Tuning Spark executors (memory, cores, instances) based on data volume and shuffle behavior
Optimizing file sizes in data lakes to balance query performance and storage costs
Implementing predicate pushdown and column pruning in source connectors for efficient filtering
Choosing between format options (Parquet, ORC, Iceberg) based on update patterns and query workloads
Monitoring cloud data transfer costs and minimizing cross-region egress in hybrid architectures
Using materialized views or pre-aggregated tables to accelerate frequent analytical queries
Right-sizing cluster nodes in managed services (e.g., EMR, Databricks) to match workload profiles
Implementing auto-scaling policies for streaming consumers based on lag metrics

Module 9: Monitoring, Alerting, and Incident Response

Defining critical metrics (e.g., lag, throughput, error rate) for real-time pipeline dashboards
Setting dynamic alert thresholds using statistical baselines instead of static values
Correlating pipeline failures with infrastructure metrics (CPU, memory, network) for root cause analysis
Implementing health checks for dependent services to detect cascading failures early
Creating runbooks with step-by-step recovery procedures for common failure scenarios
Automating rollback of pipeline deployments using versioned artifacts and infrastructure as code
Conducting blameless postmortems and tracking remediation tasks in a centralized system
Simulating failure scenarios (e.g., source outage, schema change) in staging for resilience testing