Description

This curriculum spans the technical and operational rigor of a multi-workshop platform engineering engagement, addressing the same infrastructure automation, compliance, and operational feedback challenges faced by internal DevOps teams managing large-scale, regulated cloud environments.

Module 1: Strategic Infrastructure Standardization

Selecting between immutable and mutable infrastructure patterns based on application lifecycle requirements and rollback frequency.
Defining naming conventions and tagging strategies across cloud providers to support cost allocation and security policy enforcement.
Choosing configuration drift detection mechanisms and response protocols for production environments under compliance mandates.
Implementing baseline image management using Packer with automated vulnerability scanning and patching SLAs.
Evaluating the trade-offs between shared service models and team-owned infrastructure tooling in multi-tenant platforms.
Establishing version control workflows for infrastructure code, including merge approval requirements and drift reconciliation procedures.

Module 2: Cloud-Agnostic Provisioning Design

Mapping provider-specific services (e.g., AWS Lambda, Azure Functions) to abstracted interfaces in IaC templates for portability.
Designing modular Terraform components with provider aliases to support multi-cloud staging environments.
Managing state file locking and backend configuration in distributed teams using S3 with DynamoDB or Terraform Cloud.
Implementing conditional resource creation based on environment tags without introducing configuration sprawl.
Handling secrets injection during provisioning using external vault integration versus cloud-native secret managers.
Validating infrastructure plans through automated policy-as-code checks using Open Policy Agent or HashiCorp Sentinel.

Module 3: Continuous Infrastructure Delivery

Configuring CI pipelines to perform plan generation and policy validation without applying changes in pull requests.
Orchestrating canary infrastructure rollouts for network or database tier changes using automated traffic shifting.
Integrating infrastructure tests using Terratest to verify resource attributes and connectivity post-deployment.
Managing dependency chains between interdependent infrastructure components across service boundaries.
Implementing automated rollback triggers based on CloudWatch, Prometheus, or custom health signals.
Securing pipeline access with short-lived credentials and just-in-time provisioning for production environments.

Module 4: Observability Integration for Infrastructure

Instrumenting infrastructure components with structured logging and metric exporters for centralized collection.
Correlating infrastructure events (e.g., autoscaling actions) with application performance metrics in dashboards.
Designing alert thresholds for resource exhaustion (e.g., IP space, disk IOPS) that account for burst patterns.
Implementing synthetic transaction monitoring to validate infrastructure-level connectivity and DNS resolution.
Managing log retention policies across environments to balance cost, compliance, and debugging utility.
Enabling distributed tracing for infrastructure-mediated calls (e.g., API gateways, service meshes).

Module 5: Security and Compliance Automation

Embedding CIS benchmark checks into CI/CD pipelines using tools like Checkov or Terrascan.
Automating certificate rotation for load balancers and internal services using private CAs and scheduled jobs.
Enforcing network segmentation through automated VPC flow log analysis and policy updates.
Managing IAM role inheritance and least privilege in multi-account cloud environments with service control policies.
Implementing just-enough-access (JEA) for infrastructure operators using temporary role assumption.
Generating compliance evidence packages from infrastructure state and audit logs for external review cycles.

Module 6: Scalability and Resilience Engineering

Designing autoscaling groups with predictive and reactive scaling policies based on historical load patterns.
Implementing multi-AZ and multi-region failover for stateful services with data replication SLAs.
Testing infrastructure resilience using controlled chaos engineering experiments in staging environments.
Managing stateful workloads (e.g., databases) with automated backup, restore, and point-in-time recovery workflows.
Optimizing cold start behavior for serverless infrastructure through provisioned concurrency and pre-warming.
Validating disaster recovery runbooks with automated execution simulations and RTO/RPO tracking.

Module 7: Cost Governance and Optimization

Allocating cloud spend to business units using granular tagging and cost allocation reports.
Automating resource scheduling for non-production environments using start/stop policies by team.
Right-sizing compute instances based on utilization telemetry and performance baselines.
Implementing commitment tracking for reserved instances and savings plans across hybrid environments.
Flagging underutilized resources (e.g., idle load balancers, unattached disks) with automated remediation workflows.
Integrating cost impact analysis into pull requests using tools like Infracost or CloudHealth APIs.

Module 8: Platform Team Operations and Feedback Loops

Managing self-service infrastructure portals with guardrails while accommodating edge use cases.
Collecting and prioritizing infrastructure feature requests from development teams using issue triage workflows.
Operating internal SLAs for infrastructure provisioning and incident response with public status dashboards.
Conducting blameless postmortems for infrastructure outages with action item tracking.
Rotating platform engineers through on-call duties to maintain operational empathy and system familiarity.
Measuring platform adoption and usability through deployment frequency and mean time to provision metrics.