Description

This curriculum spans the technical and operational breadth of a multi-quarter platform engineering initiative, covering the design, deployment, and governance of cloud-native systems at the scale of a large organisation’s internal developer platform.

Module 1: Architectural Foundations of Cloud-Native Systems

Selecting between monolithic decomposition and greenfield microservices based on team velocity, legacy integration needs, and deployment pipeline maturity.
Defining bounded contexts in domain-driven design to align service boundaries with business capabilities and reduce inter-service coupling.
Implementing service discovery patterns using DNS, sidecar proxies, or platform-native mechanisms like Kubernetes Services.
Choosing between synchronous (REST/gRPC) and asynchronous (message queues, event streams) communication based on latency, reliability, and scalability requirements.
Evaluating the operational impact of polyglot persistence across services, including backup strategies, observability, and data ownership.
Designing for failure by incorporating circuit breakers, timeouts, and bulkheads into inter-service communication layers.

Module 2: Containerization and Immutable Infrastructure

Constructing minimal, secure container images using distroless bases or scratch, avoiding inclusion of package managers and debugging tools in production.
Implementing multi-stage builds to separate build-time dependencies from runtime artifacts and reduce image attack surface.
Signing and verifying container images using cosign or Notary to enforce supply chain security in CI/CD pipelines.
Managing container lifecycle hooks for graceful shutdown and pre-start initialization in stateful or connection-heavy services.
Configuring resource requests and limits in Kubernetes to prevent resource starvation and ensure fair scheduling.
Enforcing pod security policies or using Pod Security Admission to restrict privileged containers and host namespace access.

Module 3: Continuous Delivery and GitOps Practices

Structuring Git repository layouts (mono-repo vs multi-repo) based on team autonomy, dependency management, and CI scalability.
Implementing canary deployments using service meshes or ingress controllers with automated traffic shifting based on health and latency metrics.
Integrating automated rollback mechanisms triggered by SLO violations or sudden error rate increases in monitoring systems.
Managing environment promotion through declarative manifests, avoiding configuration drift between staging and production.
Securing CI/CD pipelines by minimizing credential exposure using workload identity or short-lived tokens.
Using ArgoCD or Flux to enforce GitOps reconciliation and detect configuration drift in production clusters.

Module 4: Service Mesh and Inter-Service Communication

Deciding between ambient and sidecar-based service mesh architectures based on performance overhead and operational complexity.
Configuring mutual TLS between services using Istio or Linkerd to enforce zero-trust network policies.
Implementing fine-grained traffic routing rules for A/B testing, header-based routing, and version pinning in staging environments.
Enabling distributed tracing across service boundaries using W3C Trace Context and exporting spans to backend systems like Jaeger or Tempo.
Managing certificate lifecycle for service mesh identity, including rotation and cross-cluster trust setup.
Monitoring east-west traffic for anomalies using service-level metrics such as request volume, error rates, and latency percentiles.

Module 5: Observability and Runtime Intelligence

Instrumenting applications with structured logging to enable parsing, filtering, and correlation in centralized systems like Loki or Elasticsearch.
Defining service-level objectives (SLOs) and error budgets to guide incident response and feature development prioritization.
Configuring adaptive sampling for traces to balance observability fidelity with storage cost in high-throughput systems.
Correlating logs, metrics, and traces using a shared context ID propagated across service calls and message queues.
Deploying synthetic monitoring probes to detect degradation in external dependencies and third-party APIs.
Setting up anomaly detection on metrics using statistical baselines instead of static thresholds to reduce false alerts.

Module 6: Data Management in Distributed Systems

Implementing the outbox pattern to ensure reliable event publication from transactional databases without distributed transactions.
Choosing between event sourcing and CRUD-based persistence based on audit requirements, temporal queries, and system complexity.
Managing schema evolution in message contracts using versioned schemas and compatibility checks in schema registries.
Designing eventual consistency models with compensating actions for distributed workflows that cannot support two-phase commits.
Partitioning and sharding stateful services based on access patterns, geographic locality, and regulatory constraints.
Integrating change data capture (CDC) tools like Debezium to stream database changes to downstream consumers reliably.

Module 7: Security and Compliance in Cloud-Native Environments

Enforcing least-privilege IAM roles for workloads using Kubernetes service account token volume projection.
Scanning container images for CVEs and license compliance during CI, blocking high-severity findings from promotion.
Implementing secrets management using external providers (HashiCorp Vault, AWS Secrets Manager) instead of config maps or environment variables.
Conducting runtime threat detection using eBPF-based tools like Falco to identify anomalous process or network activity.
Designing audit trails for configuration changes in Kubernetes using admission webhooks and log aggregation.
Aligning deployment topology with regulatory domains (e.g., GDPR, HIPAA) through cluster isolation, data residency controls, and encryption key management.

Module 8: Platform Engineering and Internal Developer Platforms

Defining standardized deployment templates using Kubernetes Operators or Custom Resource Definitions to reduce configuration drift.
Building self-service portals for environment provisioning with guardrails on resource quotas and approved base images.
Integrating golden path workflows into CI/CD templates to guide developers toward secure, observable, and scalable defaults.
Measuring platform usability through DORA metrics collected across teams to identify bottlenecks in deployment frequency and lead time.
Managing cross-cutting concerns (logging, tracing, auth) through platform-managed sidecars or service mesh integration.
Establishing feedback loops with development teams to prioritize platform improvements based on incident root cause analysis.