This curriculum spans the technical breadth of a multi-workshop Kubernetes adoption program, addressing the same cluster design, security, and operational rigor found in enterprise-scale DevOps transformations.
Module 1: Cluster Architecture and High Availability Design
- Selecting between self-managed control planes and managed services (e.g., EKS, GKE, AKS) based on compliance requirements and operational overhead tolerance.
- Designing etcd backup and restore procedures with regular snapshot schedules and testing recovery workflows in isolated environments.
- Distributing control plane nodes across failure domains to maintain quorum during zone outages while minimizing latency.
- Configuring kube-apiserver flags to enforce request timeouts, limit concurrent requests, and prevent denial-of-service scenarios.
- Implementing dedicated worker node pools for system-critical components (e.g., CoreDNS, CNI) to isolate resource contention.
- Planning IP address allocation for pods and services to avoid CIDR exhaustion and ensure compatibility with on-prem network ranges.
Module 2: Networking and Service Connectivity
- Choosing a CNI plugin (Calico, Cilium, or Flannel) based on network policy enforcement needs, IPv6 support, and BPF requirements.
- Configuring ingress controllers (NGINX, Traefik, or Istio) with TLS termination, rate limiting, and header manipulation for production workloads.
- Implementing service mesh sidecar injection selectively using label-based namespaces to control performance impact.
- Designing multi-cluster service discovery using DNS federation or service mesh gateways for cross-cluster communication.
- Enforcing network policies to restrict pod-to-pod traffic by namespace, label, or port, including default-deny baseline policies.
- Integrating cluster networking with existing corporate firewalls and proxy infrastructure without breaking east-west traffic.
Module 3: Security and Identity Management
- Configuring RBAC roles and bindings to follow least-privilege principles, including regular audit and cleanup of unused permissions.
- Integrating external identity providers (e.g., Okta, Azure AD) with kube-apiserver using OIDC for centralized user access control.
- Rotating service account tokens and kubeconfig credentials on a defined schedule using automated tooling and audit trails.
- Enabling pod security admission (PSA) with custom profiles to block privileged containers and enforce runtime constraints.
- Scanning container images in CI/CD pipelines for CVEs and enforcing admission policies via OPA/Gatekeeper.
- Securing etcd encryption at rest with KMS-backed keys and restricting access to etcd clients through firewall rules.
Module 4: Storage and Stateful Workload Management
- Selecting persistent volume types (e.g., AWS EBS, GCP PD, NFS) based on IOPS requirements, availability zones, and backup compatibility.
- Designing StatefulSets with ordered deployment and deletion for databases requiring stable network identities and storage attachments.
- Implementing CSI snapshot controllers to enable application-consistent backups and restore operations across clusters.
- Configuring dynamic provisioning with StorageClasses tailored to performance tiers and retention policies.
- Managing lifecycle of persistent volumes during cluster migration by coordinating unmount, detach, and reattach operations.
- Enforcing storage quotas per namespace to prevent runaway claims from exhausting shared storage resources.
Module 5: CI/CD Integration and GitOps Workflows
- Choosing between GitOps (Argo CD, Flux) and imperative CI/CD pipelines based on auditability and drift remediation needs.
- Structuring Git repository layouts to separate environments (dev/staging/prod) with branch protection and approval workflows.
- Configuring automated canary deployments with traffic shifting using service mesh or ingress annotations.
- Implementing pre-deployment hooks for database schema migrations and post-deployment health validation checks.
- Managing Helm chart versioning and dependency updates with semantic versioning and automated testing in staging.
- Enabling rollback mechanisms through Git history or CI pipeline triggers with defined success criteria.
Module 6: Observability and Runtime Monitoring
- Deploying Prometheus with federation or sharding strategies to handle high-cardinality metrics in large clusters.
- Configuring liveness and readiness probes with appropriate thresholds to avoid premature restarts or traffic routing errors.
- Correlating application logs with pod metadata using structured logging and centralized collection via Fluentd or Vector.
- Setting up distributed tracing with OpenTelemetry instrumentation to diagnose latency across microservices.
- Defining SLOs and error budgets in monitoring dashboards to guide incident response and release decisions.
- Managing retention policies for metrics, logs, and traces based on compliance requirements and storage cost constraints.
Module 7: Scaling, Resource Management, and Cost Optimization
- Configuring horizontal pod autoscalers with custom or external metrics beyond CPU/memory (e.g., queue depth).
- Implementing cluster autoscaler with node group constraints to balance cost and startup latency during scale events.
- Setting resource requests and limits based on historical usage data and performance testing to prevent throttling.
- Using vertical pod autoscaling cautiously in production, with off-cycle mode for stateful applications to avoid restarts.
- Applying namespace-level resource quotas and limit ranges to enforce fair sharing and prevent resource monopolization.
- Conducting regular cost attribution reports using tools like Kubecost to identify underutilized nodes and idle workloads.
Module 8: Disaster Recovery and Multi-Cluster Operations
- Designing backup strategies for etcd and persistent volumes with geographic separation and restore validation drills.
- Implementing cluster bootstrapping automation using infrastructure-as-code (Terraform, Pulumi) for rapid recovery.
- Coordinating DNS failover and traffic routing during primary cluster outages using global load balancers.
- Replicating critical workloads across clusters using active-passive or active-active patterns with data synchronization.
- Managing configuration drift across clusters using centralized policy enforcement tools like Kyverno or OPA.
- Establishing cross-cluster logging and monitoring aggregation to maintain visibility during failover events.