This curriculum spans the technical and operational rigor of a multi-workshop cloud adoption program, addressing the same infrastructure automation, security, and operational patterns teams encounter when establishing cloud-native practices across distributed systems.
Module 1: Cloud Infrastructure as Code (IaC) Strategy and Implementation
- Select between Terraform and AWS CloudFormation based on multi-cloud requirements, team expertise, and state management complexity.
- Design reusable IaC modules with parameterized inputs to enforce consistency across development, staging, and production environments. Implement state file backend configuration using S3 with DynamoDB locking to prevent concurrent modification conflicts.
- Enforce IaC peer review workflows within CI/CD pipelines to prevent unauthorized infrastructure changes.
- Balance drift detection frequency with operational overhead by scheduling periodic audits without blocking deployment velocity.
- Integrate IaC scanning tools (e.g., Checkov or tfsec) into pre-commit hooks to catch policy violations early.
Module 2: Secure Cloud Identity and Access Management (IAM)
- Define least-privilege IAM roles for EC2 instances, avoiding the use of long-term access keys in favor of instance profiles.
- Implement cross-account access using AWS Organizations and IAM roles instead of shared credentials.
- Rotate service account keys automatically using cloud-native secret rotation (e.g., AWS Secrets Manager with Lambda).
- Enforce multi-factor authentication (MFA) for all human users, including federated identities from corporate directories.
- Map external identity providers (IdP) via SAML 2.0 for centralized access control without duplicating user management.
- Monitor IAM activity using CloudTrail and configure alerts for high-risk actions like root account usage or policy detachment.
Module 3: Cloud-Native CI/CD Pipeline Architecture
- Choose between managed build services (e.g., AWS CodeBuild, GitHub Actions) and self-hosted runners based on compliance and artifact handling needs.
- Design pipeline stages to include infrastructure provisioning, integration testing, and security scanning before production promotion.
- Cache build dependencies in private artifact repositories to reduce pipeline execution time and external dependencies.
- Implement canary deployments using cloud load balancers and feature flags to limit blast radius during rollouts.
- Enforce pipeline immutability by signing and versioning pipeline definitions in source control.
- Integrate deployment rollback triggers based on CloudWatch alarms or synthetic health checks.
Module 4: Cloud Networking and Connectivity Patterns
- Design VPC architectures with public, private, and isolated subnets aligned to application tier security requirements.
- Implement VPC peering or AWS Transit Gateway based on scalability needs and cross-account topology.
- Configure DNS resolution across VPCs using Route 53 Resolver endpoints to support internal service discovery.
- Enforce egress filtering using Security Groups, NACLs, and AWS Network Firewall based on data classification.
- Establish hybrid connectivity via AWS Direct Connect or Site-to-Site VPN with failover configurations.
- Segment workloads using micro-segmentation principles and security group referencing to limit lateral movement.
Module 5: Observability and Logging in Distributed Systems
- Centralize logs from EC2, containers, and serverless functions into a single structured logging platform (e.g., CloudWatch Logs with ingestion filters).
- Define log retention policies based on regulatory requirements and cost constraints, archiving to S3 or Glacier when needed.
- Instrument applications with distributed tracing (e.g., AWS X-Ray) to diagnose latency across microservices.
- Configure custom CloudWatch metrics and dashboards for business-critical KPIs, not just infrastructure health.
- Correlate logs, metrics, and traces using unique request IDs to reduce mean time to diagnosis (MTTD).
- Suppress alert noise by tuning thresholds using historical baselines and anomaly detection algorithms.
Module 6: Cost Governance and Cloud Financial Management
- Implement tagging strategies for cost allocation, requiring tags like 'owner', 'environment', and 'application' at resource creation.
- Use AWS Cost Explorer and budgets to monitor spend anomalies and enforce accountability by team or project.
- Select between on-demand, reserved instances, and spot instances based on workload stability and fault tolerance.
- Automate shutdown of non-production resources during off-hours using scheduled Lambda functions.
- Negotiate enterprise discount plans only after establishing baseline usage and forecasting accuracy.
- Conduct monthly showback/chargeback reviews with engineering leads to align spending with value delivery.
Module 7: Resilience and Disaster Recovery in the Cloud
- Define recovery time objectives (RTO) and recovery point objectives (RPO) per application tier to guide backup strategies.
- Implement multi-AZ database deployments with automated failover using RDS or Aurora, avoiding single points of failure.
- Test disaster recovery runbooks quarterly by simulating region outages using controlled resource termination.
- Replicate critical data across regions using S3 Cross-Region Replication with versioning enabled.
- Use infrastructure blueprints to rebuild environments from IaC rather than relying solely on backups.
- Validate failover DNS routing using Route 53 health checks and weighted routing policies.
Module 8: Containerization and Orchestration at Scale
- Choose between AWS ECS and EKS based on team Kubernetes expertise and long-term platform portability goals.
- Secure container images by scanning for vulnerabilities in CI and enforcing image signing with Notary or Sigstore.
- Limit container resource usage with CPU and memory requests/limits to prevent noisy neighbor issues.
- Implement rolling update strategies with readiness and liveness probes to maintain service availability.
- Manage secrets using platform-native tools (e.g., AWS Systems Manager Parameter Store or Secrets Manager) instead of environment variables.
- Monitor cluster health using control plane metrics and node termination handlers to respond to spot interruptions.