Description

This curriculum spans the technical breadth of a multi-workshop Kubernetes adoption program, addressing the same cluster design, security, and operational rigor found in enterprise-scale DevOps transformations.

Module 1: Cluster Architecture and High Availability Design

Selecting between self-managed control planes and managed services (e.g., EKS, GKE, AKS) based on compliance requirements and operational overhead tolerance.
Designing etcd backup and restore procedures with regular snapshot schedules and testing recovery workflows in isolated environments.
Distributing control plane nodes across failure domains to maintain quorum during zone outages while minimizing latency.
Configuring kube-apiserver flags to enforce request timeouts, limit concurrent requests, and prevent denial-of-service scenarios.
Implementing dedicated worker node pools for system-critical components (e.g., CoreDNS, CNI) to isolate resource contention.
Planning IP address allocation for pods and services to avoid CIDR exhaustion and ensure compatibility with on-prem network ranges.

Module 2: Networking and Service Connectivity

Choosing a CNI plugin (Calico, Cilium, or Flannel) based on network policy enforcement needs, IPv6 support, and BPF requirements.
Configuring ingress controllers (NGINX, Traefik, or Istio) with TLS termination, rate limiting, and header manipulation for production workloads.
Implementing service mesh sidecar injection selectively using label-based namespaces to control performance impact.
Designing multi-cluster service discovery using DNS federation or service mesh gateways for cross-cluster communication.
Enforcing network policies to restrict pod-to-pod traffic by namespace, label, or port, including default-deny baseline policies.
Integrating cluster networking with existing corporate firewalls and proxy infrastructure without breaking east-west traffic.

Module 3: Security and Identity Management

Configuring RBAC roles and bindings to follow least-privilege principles, including regular audit and cleanup of unused permissions.
Integrating external identity providers (e.g., Okta, Azure AD) with kube-apiserver using OIDC for centralized user access control.
Rotating service account tokens and kubeconfig credentials on a defined schedule using automated tooling and audit trails.
Enabling pod security admission (PSA) with custom profiles to block privileged containers and enforce runtime constraints.
Scanning container images in CI/CD pipelines for CVEs and enforcing admission policies via OPA/Gatekeeper.
Securing etcd encryption at rest with KMS-backed keys and restricting access to etcd clients through firewall rules.

Module 4: Storage and Stateful Workload Management

Selecting persistent volume types (e.g., AWS EBS, GCP PD, NFS) based on IOPS requirements, availability zones, and backup compatibility.
Designing StatefulSets with ordered deployment and deletion for databases requiring stable network identities and storage attachments.
Implementing CSI snapshot controllers to enable application-consistent backups and restore operations across clusters.
Configuring dynamic provisioning with StorageClasses tailored to performance tiers and retention policies.
Managing lifecycle of persistent volumes during cluster migration by coordinating unmount, detach, and reattach operations.
Enforcing storage quotas per namespace to prevent runaway claims from exhausting shared storage resources.

Module 5: CI/CD Integration and GitOps Workflows

Choosing between GitOps (Argo CD, Flux) and imperative CI/CD pipelines based on auditability and drift remediation needs.
Structuring Git repository layouts to separate environments (dev/staging/prod) with branch protection and approval workflows.
Configuring automated canary deployments with traffic shifting using service mesh or ingress annotations.
Implementing pre-deployment hooks for database schema migrations and post-deployment health validation checks.
Managing Helm chart versioning and dependency updates with semantic versioning and automated testing in staging.
Enabling rollback mechanisms through Git history or CI pipeline triggers with defined success criteria.

Module 6: Observability and Runtime Monitoring

Deploying Prometheus with federation or sharding strategies to handle high-cardinality metrics in large clusters.
Configuring liveness and readiness probes with appropriate thresholds to avoid premature restarts or traffic routing errors.
Correlating application logs with pod metadata using structured logging and centralized collection via Fluentd or Vector.
Setting up distributed tracing with OpenTelemetry instrumentation to diagnose latency across microservices.
Defining SLOs and error budgets in monitoring dashboards to guide incident response and release decisions.
Managing retention policies for metrics, logs, and traces based on compliance requirements and storage cost constraints.

Module 7: Scaling, Resource Management, and Cost Optimization

Configuring horizontal pod autoscalers with custom or external metrics beyond CPU/memory (e.g., queue depth).
Implementing cluster autoscaler with node group constraints to balance cost and startup latency during scale events.
Setting resource requests and limits based on historical usage data and performance testing to prevent throttling.
Using vertical pod autoscaling cautiously in production, with off-cycle mode for stateful applications to avoid restarts.
Applying namespace-level resource quotas and limit ranges to enforce fair sharing and prevent resource monopolization.
Conducting regular cost attribution reports using tools like Kubecost to identify underutilized nodes and idle workloads.

Module 8: Disaster Recovery and Multi-Cluster Operations

Designing backup strategies for etcd and persistent volumes with geographic separation and restore validation drills.
Implementing cluster bootstrapping automation using infrastructure-as-code (Terraform, Pulumi) for rapid recovery.
Coordinating DNS failover and traffic routing during primary cluster outages using global load balancers.
Replicating critical workloads across clusters using active-passive or active-active patterns with data synchronization.
Managing configuration drift across clusters using centralized policy enforcement tools like Kyverno or OPA.
Establishing cross-cluster logging and monitoring aggregation to maintain visibility during failover events.