Description

This curriculum spans the technical and operational rigor of a multi-workshop cloud transformation program, addressing the same architectural, security, and platform decisions faced during large-scale internal developer platform rollouts and cross-team service modernization efforts.

Module 1: Architectural Foundations of Cloud Native Systems

Selecting between monolithic decomposition and greenfield microservices based on team maturity and system criticality.
Defining service boundaries using domain-driven design (DDD) to align with business capabilities and reduce coupling.
Implementing API gateways to manage routing, authentication, and rate limiting across heterogeneous backend services.
Choosing synchronous (REST/gRPC) versus asynchronous (message queues) communication based on latency and reliability requirements.
Evaluating the operational overhead of service mesh adoption for observability, security, and traffic control.
Designing for disposability by ensuring services start quickly and terminate gracefully to support autoscaling and rolling updates.

Module 2: Containerization and Orchestration at Scale

Standardizing container images using base image policies and vulnerability scanning in CI pipelines.
Configuring Kubernetes resource requests and limits to balance performance, cost, and cluster utilization.
Implementing pod disruption budgets to maintain availability during node maintenance or cluster upgrades.
Managing configuration and secrets using Kubernetes ConfigMaps and external secret managers like HashiCorp Vault.
Designing multi-tenant namespaces with network policies and resource quotas to isolate workloads.
Automating cluster lifecycle management using infrastructure-as-code tools like Terraform or Pulumi.

Module 3: Continuous Delivery and GitOps Practices

Structuring Git repository layouts (mono-repo vs. multi-repo) to support independent service deployment and team autonomy.
Implementing canary deployments with traffic shifting using service mesh or ingress controllers.
Enforcing deployment approvals and rollback mechanisms for production environments via policy engines.
Integrating security scanning (SAST, DAST, SCA) into CI pipelines without introducing unacceptable delays.
Using GitOps tools like Argo CD to reconcile desired state and detect configuration drift in clusters.
Managing Helm chart versioning and dependency updates across multiple environments and teams.

Module 4: Observability and Runtime Intelligence

Instrumenting applications with structured logging to enable correlation across distributed transactions.
Configuring distributed tracing with context propagation to identify latency bottlenecks in service chains.
Designing custom metrics and dashboards that reflect business KPIs, not just infrastructure health.
Setting meaningful alert thresholds using SLOs and error budgets to reduce alert fatigue.
Implementing log sampling strategies to control costs in high-throughput systems.
Integrating observability data with incident response tools to streamline root cause analysis.

Module 5: Resilience and Fault Tolerance Engineering

Implementing circuit breakers and bulkheads to prevent cascading failures during downstream outages.
Designing retry strategies with exponential backoff and jitter to avoid thundering herd problems.
Conducting regular chaos engineering experiments to validate recovery procedures in production-like environments.
Ensuring stateful services use persistent storage with appropriate backup and restore workflows.
Using health checks (liveness and readiness probes) to guide Kubernetes restart and routing decisions.
Planning for region failover by replicating critical services and data with acceptable RPO and RTO.

Module 6: Security and Compliance in Distributed Systems

Enforcing zero-trust network policies using service mesh or CNI plugins to restrict inter-service communication.
Managing identity and access for workloads using short-lived tokens and workload identity federation.
Implementing pod security standards through admission controllers to prevent privilege escalation.
Conducting regular compliance audits of container images and runtime configurations using automated tooling.
Encrypting data in transit with mTLS across service mesh and external endpoints.
Integrating security posture management tools to detect misconfigurations in Kubernetes manifests.

Module 7: Platform Engineering and Internal Developer Platforms

Defining standardized deployment templates to reduce configuration drift and onboarding time.
Building self-service portals for environment provisioning with guardrails for cost and compliance.
Integrating feedback loops from operations into developer workflows via embedded observability.
Managing API documentation and contract testing to ensure backward compatibility across teams.
Operating a catalog of reusable components and managed services to reduce duplication.
Measuring platform effectiveness using DORA metrics without incentivizing gaming of the system.

Module 8: Cost Management and Resource Optimization

Right-sizing container resources using historical usage data and vertical pod autoscalers.
Implementing spot instance usage with workload tolerance for interruptions and fallback strategies.
Tagging cloud resources by team, project, and environment to enable cost allocation and accountability.
Automating scale-to-zero for non-production workloads during off-hours to reduce spend.
Monitoring egress costs and optimizing data transfer patterns between regions and services.
Conducting regular cost reviews with engineering teams to align technical decisions with budget constraints.