Description

This curriculum spans the technical and organisational complexity of a multi-workshop infrastructure modernisation programme, addressing the same design decisions and trade-offs encountered in large-scale cloud adoption and internal platform engineering initiatives.

Module 1: Defining Application Requirements and Constraints

Selecting between monolithic and microservices architecture based on team size, deployment frequency, and domain complexity.
Negotiating latency SLAs with business stakeholders to inform infrastructure tiering and data center placement.
Evaluating data residency laws when choosing cloud regions for application deployment.
Documenting non-functional requirements such as throughput, concurrency, and failover behavior for infrastructure validation.
Assessing third-party API dependencies and their uptime guarantees when designing retry and fallback mechanisms.
Aligning infrastructure choices with application lifecycle phases, including development, staging, and production parity.

Module 2: Cloud Platform Selection and Account Strategy

Implementing multi-account AWS Organizations or Azure AD tenants to isolate environments and limit blast radius.
Choosing between public cloud providers based on available managed services, egress costs, and support response times.
Designing cross-cloud identity federation for hybrid access without duplicating user management.
Allocating budget ownership across departments using cloud cost allocation tags and chargeback models.
Establishing network peering or transit gateway architectures for inter-account communication.
Enforcing baseline security controls via SCPs (Service Control Policies) or Azure Policy across all accounts.

Module 3: Networking and Connectivity Architecture

Configuring VPC peering versus transit gateways based on scalability and routing complexity requirements.
Implementing DNS failover and latency-based routing in multi-region application deployments.
Deciding between public and private subnets for backend services based on attack surface exposure.
Integrating on-premises data centers with cloud VPCs using IPsec VPNs or AWS Direct Connect.
Setting up service endpoints or private links to restrict data access to authorized VPCs only.
Managing CIDR block allocation across environments to prevent overlap during mergers or migrations.

Module 4: Compute and Container Orchestration Strategy

Selecting EC2 instance types based on memory, CPU, and burst requirements for stateful versus stateless workloads.
Configuring auto-scaling policies using custom CloudWatch metrics versus request count.
Choosing between Kubernetes (EKS/GKE/AKS) and serverless (Fargate/Lambda) based on operational overhead tolerance.
Implementing pod disruption budgets and node taints to control rolling update impact on availability.
Managing container image lifecycle with vulnerability scanning and immutable tagging in private registries.
Designing sidecar patterns for logging and monitoring without coupling application logic.

Module 5: Data Storage and Persistence Design

Selecting database engine (RDS, DynamoDB, CosmosDB) based on query patterns and consistency requirements.
Implementing read replicas and connection pooling to handle high-concurrency reporting workloads.
Designing backup retention policies and cross-region replication for RPO and RTO compliance.
Partitioning large datasets using time-based or tenant-based sharding to maintain query performance.
Choosing between synchronous and asynchronous replication for multi-region data consistency.
Encrypting data at rest using customer-managed KMS keys with defined rotation policies.

Module 6: Security, Identity, and Access Governance

Implementing least-privilege IAM roles with scoped-down policies for service accounts.
Enforcing MFA for privileged access and conditional access policies in hybrid environments.
Integrating secrets management (HashiCorp Vault, AWS Secrets Manager) into CI/CD pipelines.
Configuring audit logging for API calls and data access with centralized SIEM ingestion.
Managing cross-account role assumptions using trust policies with external ID validation.
Rotating long-lived credentials and service account keys on a defined schedule with automation.

Module 7: Observability and Operational Resilience

Instrumenting applications with structured logging to enable efficient log aggregation and querying.
Defining SLOs and error budgets to guide incident response and feature release decisions.
Configuring synthetic monitoring to detect degradation before user impact occurs.
Setting up alerting thresholds that minimize false positives while capturing critical failures.
Implementing distributed tracing to diagnose latency across microservices and third-party calls.
Conducting chaos engineering experiments to validate failover and recovery procedures.

Module 8: CI/CD Pipeline and Infrastructure as Code

Authoring Terraform modules with versioned inputs to promote reuse across environments.
Managing state files in remote backends with locking and access controls to prevent conflicts.
Implementing pipeline approvals and automated policy checks (e.g., Open Policy Agent) before production deployment.
Designing blue-green or canary deployments with traffic shifting via load balancer or service mesh.
Validating infrastructure changes using automated testing in pre-production environments.
Handling configuration drift by enforcing declarative state and scheduling drift detection scans.

Infrastructure Design in Application Development