This curriculum spans the technical and operational breadth of a multi-quarter platform engineering initiative, covering the design, deployment, and governance of cloud-native systems at the scale of a large organisation’s internal developer platform.
Module 1: Architectural Foundations of Cloud-Native Systems
- Selecting between monolithic decomposition and greenfield microservices based on team velocity, legacy integration needs, and deployment pipeline maturity.
- Defining bounded contexts in domain-driven design to align service boundaries with business capabilities and reduce inter-service coupling.
- Implementing service discovery patterns using DNS, sidecar proxies, or platform-native mechanisms like Kubernetes Services.
- Choosing between synchronous (REST/gRPC) and asynchronous (message queues, event streams) communication based on latency, reliability, and scalability requirements.
- Evaluating the operational impact of polyglot persistence across services, including backup strategies, observability, and data ownership.
- Designing for failure by incorporating circuit breakers, timeouts, and bulkheads into inter-service communication layers.
Module 2: Containerization and Immutable Infrastructure
- Constructing minimal, secure container images using distroless bases or scratch, avoiding inclusion of package managers and debugging tools in production.
- Implementing multi-stage builds to separate build-time dependencies from runtime artifacts and reduce image attack surface.
- Signing and verifying container images using cosign or Notary to enforce supply chain security in CI/CD pipelines.
- Managing container lifecycle hooks for graceful shutdown and pre-start initialization in stateful or connection-heavy services.
- Configuring resource requests and limits in Kubernetes to prevent resource starvation and ensure fair scheduling.
- Enforcing pod security policies or using Pod Security Admission to restrict privileged containers and host namespace access.
Module 3: Continuous Delivery and GitOps Practices
- Structuring Git repository layouts (mono-repo vs multi-repo) based on team autonomy, dependency management, and CI scalability.
- Implementing canary deployments using service meshes or ingress controllers with automated traffic shifting based on health and latency metrics.
- Integrating automated rollback mechanisms triggered by SLO violations or sudden error rate increases in monitoring systems.
- Managing environment promotion through declarative manifests, avoiding configuration drift between staging and production.
- Securing CI/CD pipelines by minimizing credential exposure using workload identity or short-lived tokens.
- Using ArgoCD or Flux to enforce GitOps reconciliation and detect configuration drift in production clusters.
Module 4: Service Mesh and Inter-Service Communication
- Deciding between ambient and sidecar-based service mesh architectures based on performance overhead and operational complexity.
- Configuring mutual TLS between services using Istio or Linkerd to enforce zero-trust network policies.
- Implementing fine-grained traffic routing rules for A/B testing, header-based routing, and version pinning in staging environments.
- Enabling distributed tracing across service boundaries using W3C Trace Context and exporting spans to backend systems like Jaeger or Tempo.
- Managing certificate lifecycle for service mesh identity, including rotation and cross-cluster trust setup.
- Monitoring east-west traffic for anomalies using service-level metrics such as request volume, error rates, and latency percentiles.
Module 5: Observability and Runtime Intelligence
- Instrumenting applications with structured logging to enable parsing, filtering, and correlation in centralized systems like Loki or Elasticsearch.
- Defining service-level objectives (SLOs) and error budgets to guide incident response and feature development prioritization.
- Configuring adaptive sampling for traces to balance observability fidelity with storage cost in high-throughput systems.
- Correlating logs, metrics, and traces using a shared context ID propagated across service calls and message queues.
- Deploying synthetic monitoring probes to detect degradation in external dependencies and third-party APIs.
- Setting up anomaly detection on metrics using statistical baselines instead of static thresholds to reduce false alerts.
Module 6: Data Management in Distributed Systems
- Implementing the outbox pattern to ensure reliable event publication from transactional databases without distributed transactions.
- Choosing between event sourcing and CRUD-based persistence based on audit requirements, temporal queries, and system complexity.
- Managing schema evolution in message contracts using versioned schemas and compatibility checks in schema registries.
- Designing eventual consistency models with compensating actions for distributed workflows that cannot support two-phase commits.
- Partitioning and sharding stateful services based on access patterns, geographic locality, and regulatory constraints.
- Integrating change data capture (CDC) tools like Debezium to stream database changes to downstream consumers reliably.
Module 7: Security and Compliance in Cloud-Native Environments
- Enforcing least-privilege IAM roles for workloads using Kubernetes service account token volume projection.
- Scanning container images for CVEs and license compliance during CI, blocking high-severity findings from promotion.
- Implementing secrets management using external providers (HashiCorp Vault, AWS Secrets Manager) instead of config maps or environment variables.
- Conducting runtime threat detection using eBPF-based tools like Falco to identify anomalous process or network activity.
- Designing audit trails for configuration changes in Kubernetes using admission webhooks and log aggregation.
- Aligning deployment topology with regulatory domains (e.g., GDPR, HIPAA) through cluster isolation, data residency controls, and encryption key management.
Module 8: Platform Engineering and Internal Developer Platforms
- Defining standardized deployment templates using Kubernetes Operators or Custom Resource Definitions to reduce configuration drift.
- Building self-service portals for environment provisioning with guardrails on resource quotas and approved base images.
- Integrating golden path workflows into CI/CD templates to guide developers toward secure, observable, and scalable defaults.
- Measuring platform usability through DORA metrics collected across teams to identify bottlenecks in deployment frequency and lead time.
- Managing cross-cutting concerns (logging, tracing, auth) through platform-managed sidecars or service mesh integration.
- Establishing feedback loops with development teams to prioritize platform improvements based on incident root cause analysis.