This curriculum spans the technical and operational rigor of a multi-workshop platform engineering engagement, addressing the same infrastructure automation, compliance, and operational feedback challenges faced by internal DevOps teams managing large-scale, regulated cloud environments.
Module 1: Strategic Infrastructure Standardization
- Selecting between immutable and mutable infrastructure patterns based on application lifecycle requirements and rollback frequency.
- Defining naming conventions and tagging strategies across cloud providers to support cost allocation and security policy enforcement.
- Choosing configuration drift detection mechanisms and response protocols for production environments under compliance mandates.
- Implementing baseline image management using Packer with automated vulnerability scanning and patching SLAs.
- Evaluating the trade-offs between shared service models and team-owned infrastructure tooling in multi-tenant platforms.
- Establishing version control workflows for infrastructure code, including merge approval requirements and drift reconciliation procedures.
Module 2: Cloud-Agnostic Provisioning Design
- Mapping provider-specific services (e.g., AWS Lambda, Azure Functions) to abstracted interfaces in IaC templates for portability.
- Designing modular Terraform components with provider aliases to support multi-cloud staging environments.
- Managing state file locking and backend configuration in distributed teams using S3 with DynamoDB or Terraform Cloud.
- Implementing conditional resource creation based on environment tags without introducing configuration sprawl.
- Handling secrets injection during provisioning using external vault integration versus cloud-native secret managers.
- Validating infrastructure plans through automated policy-as-code checks using Open Policy Agent or HashiCorp Sentinel.
Module 3: Continuous Infrastructure Delivery
- Configuring CI pipelines to perform plan generation and policy validation without applying changes in pull requests.
- Orchestrating canary infrastructure rollouts for network or database tier changes using automated traffic shifting.
- Integrating infrastructure tests using Terratest to verify resource attributes and connectivity post-deployment.
- Managing dependency chains between interdependent infrastructure components across service boundaries.
- Implementing automated rollback triggers based on CloudWatch, Prometheus, or custom health signals.
- Securing pipeline access with short-lived credentials and just-in-time provisioning for production environments.
Module 4: Observability Integration for Infrastructure
- Instrumenting infrastructure components with structured logging and metric exporters for centralized collection.
- Correlating infrastructure events (e.g., autoscaling actions) with application performance metrics in dashboards.
- Designing alert thresholds for resource exhaustion (e.g., IP space, disk IOPS) that account for burst patterns.
- Implementing synthetic transaction monitoring to validate infrastructure-level connectivity and DNS resolution.
- Managing log retention policies across environments to balance cost, compliance, and debugging utility.
- Enabling distributed tracing for infrastructure-mediated calls (e.g., API gateways, service meshes).
Module 5: Security and Compliance Automation
- Embedding CIS benchmark checks into CI/CD pipelines using tools like Checkov or Terrascan.
- Automating certificate rotation for load balancers and internal services using private CAs and scheduled jobs.
- Enforcing network segmentation through automated VPC flow log analysis and policy updates.
- Managing IAM role inheritance and least privilege in multi-account cloud environments with service control policies.
- Implementing just-enough-access (JEA) for infrastructure operators using temporary role assumption.
- Generating compliance evidence packages from infrastructure state and audit logs for external review cycles.
Module 6: Scalability and Resilience Engineering
- Designing autoscaling groups with predictive and reactive scaling policies based on historical load patterns.
- Implementing multi-AZ and multi-region failover for stateful services with data replication SLAs.
- Testing infrastructure resilience using controlled chaos engineering experiments in staging environments.
- Managing stateful workloads (e.g., databases) with automated backup, restore, and point-in-time recovery workflows.
- Optimizing cold start behavior for serverless infrastructure through provisioned concurrency and pre-warming.
- Validating disaster recovery runbooks with automated execution simulations and RTO/RPO tracking.
Module 7: Cost Governance and Optimization
- Allocating cloud spend to business units using granular tagging and cost allocation reports.
- Automating resource scheduling for non-production environments using start/stop policies by team.
- Right-sizing compute instances based on utilization telemetry and performance baselines.
- Implementing commitment tracking for reserved instances and savings plans across hybrid environments.
- Flagging underutilized resources (e.g., idle load balancers, unattached disks) with automated remediation workflows.
- Integrating cost impact analysis into pull requests using tools like Infracost or CloudHealth APIs.
Module 8: Platform Team Operations and Feedback Loops
- Managing self-service infrastructure portals with guardrails while accommodating edge use cases.
- Collecting and prioritizing infrastructure feature requests from development teams using issue triage workflows.
- Operating internal SLAs for infrastructure provisioning and incident response with public status dashboards.
- Conducting blameless postmortems for infrastructure outages with action item tracking.
- Rotating platform engineers through on-call duties to maintain operational empathy and system familiarity.
- Measuring platform adoption and usability through deployment frequency and mean time to provision metrics.