This curriculum spans the technical and organisational complexity of a multi-workshop infrastructure modernisation programme, addressing the same design decisions and trade-offs encountered in large-scale cloud adoption and internal platform engineering initiatives.
Module 1: Defining Application Requirements and Constraints
- Selecting between monolithic and microservices architecture based on team size, deployment frequency, and domain complexity.
- Negotiating latency SLAs with business stakeholders to inform infrastructure tiering and data center placement.
- Evaluating data residency laws when choosing cloud regions for application deployment.
- Documenting non-functional requirements such as throughput, concurrency, and failover behavior for infrastructure validation.
- Assessing third-party API dependencies and their uptime guarantees when designing retry and fallback mechanisms.
- Aligning infrastructure choices with application lifecycle phases, including development, staging, and production parity.
Module 2: Cloud Platform Selection and Account Strategy
- Implementing multi-account AWS Organizations or Azure AD tenants to isolate environments and limit blast radius.
- Choosing between public cloud providers based on available managed services, egress costs, and support response times.
- Designing cross-cloud identity federation for hybrid access without duplicating user management.
- Allocating budget ownership across departments using cloud cost allocation tags and chargeback models.
- Establishing network peering or transit gateway architectures for inter-account communication.
- Enforcing baseline security controls via SCPs (Service Control Policies) or Azure Policy across all accounts.
Module 3: Networking and Connectivity Architecture
- Configuring VPC peering versus transit gateways based on scalability and routing complexity requirements.
- Implementing DNS failover and latency-based routing in multi-region application deployments.
- Deciding between public and private subnets for backend services based on attack surface exposure.
- Integrating on-premises data centers with cloud VPCs using IPsec VPNs or AWS Direct Connect.
- Setting up service endpoints or private links to restrict data access to authorized VPCs only.
- Managing CIDR block allocation across environments to prevent overlap during mergers or migrations.
Module 4: Compute and Container Orchestration Strategy
- Selecting EC2 instance types based on memory, CPU, and burst requirements for stateful versus stateless workloads.
- Configuring auto-scaling policies using custom CloudWatch metrics versus request count.
- Choosing between Kubernetes (EKS/GKE/AKS) and serverless (Fargate/Lambda) based on operational overhead tolerance.
- Implementing pod disruption budgets and node taints to control rolling update impact on availability.
- Managing container image lifecycle with vulnerability scanning and immutable tagging in private registries.
- Designing sidecar patterns for logging and monitoring without coupling application logic.
Module 5: Data Storage and Persistence Design
- Selecting database engine (RDS, DynamoDB, CosmosDB) based on query patterns and consistency requirements.
- Implementing read replicas and connection pooling to handle high-concurrency reporting workloads.
- Designing backup retention policies and cross-region replication for RPO and RTO compliance.
- Partitioning large datasets using time-based or tenant-based sharding to maintain query performance.
- Choosing between synchronous and asynchronous replication for multi-region data consistency.
- Encrypting data at rest using customer-managed KMS keys with defined rotation policies.
Module 6: Security, Identity, and Access Governance
- Implementing least-privilege IAM roles with scoped-down policies for service accounts.
- Enforcing MFA for privileged access and conditional access policies in hybrid environments.
- Integrating secrets management (HashiCorp Vault, AWS Secrets Manager) into CI/CD pipelines.
- Configuring audit logging for API calls and data access with centralized SIEM ingestion.
- Managing cross-account role assumptions using trust policies with external ID validation.
- Rotating long-lived credentials and service account keys on a defined schedule with automation.
Module 7: Observability and Operational Resilience
- Instrumenting applications with structured logging to enable efficient log aggregation and querying.
- Defining SLOs and error budgets to guide incident response and feature release decisions.
- Configuring synthetic monitoring to detect degradation before user impact occurs.
- Setting up alerting thresholds that minimize false positives while capturing critical failures.
- Implementing distributed tracing to diagnose latency across microservices and third-party calls.
- Conducting chaos engineering experiments to validate failover and recovery procedures.
Module 8: CI/CD Pipeline and Infrastructure as Code
- Authoring Terraform modules with versioned inputs to promote reuse across environments.
- Managing state files in remote backends with locking and access controls to prevent conflicts.
- Implementing pipeline approvals and automated policy checks (e.g., Open Policy Agent) before production deployment.
- Designing blue-green or canary deployments with traffic shifting via load balancer or service mesh.
- Validating infrastructure changes using automated testing in pre-production environments.
- Handling configuration drift by enforcing declarative state and scheduling drift detection scans.