This curriculum spans the technical and operational rigor of a multi-workshop cloud transformation program, addressing the same architectural, security, and platform decisions faced during large-scale internal developer platform rollouts and cross-team service modernization efforts.
Module 1: Architectural Foundations of Cloud Native Systems
- Selecting between monolithic decomposition and greenfield microservices based on team maturity and system criticality.
- Defining service boundaries using domain-driven design (DDD) to align with business capabilities and reduce coupling.
- Implementing API gateways to manage routing, authentication, and rate limiting across heterogeneous backend services.
- Choosing synchronous (REST/gRPC) versus asynchronous (message queues) communication based on latency and reliability requirements.
- Evaluating the operational overhead of service mesh adoption for observability, security, and traffic control.
- Designing for disposability by ensuring services start quickly and terminate gracefully to support autoscaling and rolling updates.
Module 2: Containerization and Orchestration at Scale
- Standardizing container images using base image policies and vulnerability scanning in CI pipelines.
- Configuring Kubernetes resource requests and limits to balance performance, cost, and cluster utilization.
- Implementing pod disruption budgets to maintain availability during node maintenance or cluster upgrades.
- Managing configuration and secrets using Kubernetes ConfigMaps and external secret managers like HashiCorp Vault.
- Designing multi-tenant namespaces with network policies and resource quotas to isolate workloads.
- Automating cluster lifecycle management using infrastructure-as-code tools like Terraform or Pulumi.
Module 3: Continuous Delivery and GitOps Practices
- Structuring Git repository layouts (mono-repo vs. multi-repo) to support independent service deployment and team autonomy.
- Implementing canary deployments with traffic shifting using service mesh or ingress controllers.
- Enforcing deployment approvals and rollback mechanisms for production environments via policy engines.
- Integrating security scanning (SAST, DAST, SCA) into CI pipelines without introducing unacceptable delays.
- Using GitOps tools like Argo CD to reconcile desired state and detect configuration drift in clusters.
- Managing Helm chart versioning and dependency updates across multiple environments and teams.
Module 4: Observability and Runtime Intelligence
- Instrumenting applications with structured logging to enable correlation across distributed transactions.
- Configuring distributed tracing with context propagation to identify latency bottlenecks in service chains.
- Designing custom metrics and dashboards that reflect business KPIs, not just infrastructure health.
- Setting meaningful alert thresholds using SLOs and error budgets to reduce alert fatigue.
- Implementing log sampling strategies to control costs in high-throughput systems.
- Integrating observability data with incident response tools to streamline root cause analysis.
Module 5: Resilience and Fault Tolerance Engineering
- Implementing circuit breakers and bulkheads to prevent cascading failures during downstream outages.
- Designing retry strategies with exponential backoff and jitter to avoid thundering herd problems.
- Conducting regular chaos engineering experiments to validate recovery procedures in production-like environments.
- Ensuring stateful services use persistent storage with appropriate backup and restore workflows.
- Using health checks (liveness and readiness probes) to guide Kubernetes restart and routing decisions.
- Planning for region failover by replicating critical services and data with acceptable RPO and RTO.
Module 6: Security and Compliance in Distributed Systems
- Enforcing zero-trust network policies using service mesh or CNI plugins to restrict inter-service communication.
- Managing identity and access for workloads using short-lived tokens and workload identity federation.
- Implementing pod security standards through admission controllers to prevent privilege escalation.
- Conducting regular compliance audits of container images and runtime configurations using automated tooling.
- Encrypting data in transit with mTLS across service mesh and external endpoints.
- Integrating security posture management tools to detect misconfigurations in Kubernetes manifests.
Module 7: Platform Engineering and Internal Developer Platforms
- Defining standardized deployment templates to reduce configuration drift and onboarding time.
- Building self-service portals for environment provisioning with guardrails for cost and compliance.
- Integrating feedback loops from operations into developer workflows via embedded observability.
- Managing API documentation and contract testing to ensure backward compatibility across teams.
- Operating a catalog of reusable components and managed services to reduce duplication.
- Measuring platform effectiveness using DORA metrics without incentivizing gaming of the system.
Module 8: Cost Management and Resource Optimization
- Right-sizing container resources using historical usage data and vertical pod autoscalers.
- Implementing spot instance usage with workload tolerance for interruptions and fallback strategies.
- Tagging cloud resources by team, project, and environment to enable cost allocation and accountability.
- Automating scale-to-zero for non-production workloads during off-hours to reduce spend.
- Monitoring egress costs and optimizing data transfer patterns between regions and services.
- Conducting regular cost reviews with engineering teams to align technical decisions with budget constraints.