This curriculum spans the technical and organisational complexity of a multi-workshop cloud performance optimisation program, addressing the interplay between architecture, operations, and governance seen in large-scale cloud adoption projects.
Module 1: Performance Requirements Definition and Baseline Establishment
- Selecting transactional vs. batch performance benchmarks based on business SLAs for order processing systems.
- Instrumenting production workloads to capture latency, throughput, and error rate baselines before migration.
- Defining acceptable performance thresholds for peak vs. off-peak operations in multi-region applications.
- Mapping legacy system performance profiles to cloud-native service tiers (e.g., matching on-prem database IOPS to provisioned cloud storage).
- Documenting non-functional requirements for third-party audit and compliance validation during cloud transition.
- Establishing performance budget allocations per microservice to prevent resource contention in shared environments.
Module 2: Cloud Architecture Design for Performance Scalability
- Choosing between auto-scaling groups and container orchestration (e.g., Kubernetes HPA) based on application startup latency sensitivity.
- Implementing caching layers (e.g., Redis or ElastiCache) with cache-aside vs. read-through strategies for high-frequency data access.
- Designing asynchronous communication patterns using message queues (e.g., SQS, Pub/Sub) to decouple performance-critical components.
- Selecting regional vs. multi-regional deployment architectures based on data residency laws and user proximity requirements.
- Configuring CDN edge locations with cache invalidation policies aligned to content update frequency.
- Integrating database read replicas with connection pooling to reduce primary instance load during reporting workloads.
Module 3: Migration Strategy and Performance Risk Mitigation
- Executing pilot migrations in non-production environments to validate performance parity with legacy systems.
- Choosing rehosting vs. refactoring based on application statefulness and real-time processing dependencies.
- Implementing blue-green deployment with weighted routing to monitor performance impact during cutover.
- Planning data migration batches to minimize network congestion and avoid peak business hours.
- Using database migration tools with throttling controls to prevent source system degradation during replication.
- Configuring DNS TTL values in advance to accelerate failover during migration rollback scenarios.
Module 4: Monitoring, Observability, and Performance Diagnostics
- Deploying distributed tracing across microservices to identify latency bottlenecks in request chains.
- Configuring synthetic transaction monitoring to simulate user workflows across geographic regions.
- Setting up alert thresholds for CPU steal time and memory ballooning in virtualized cloud instances.
- Correlating application logs with infrastructure metrics to isolate root causes of performance degradation.
- Implementing custom metrics collection for business-critical operations not captured by default monitoring tools.
- Managing log retention policies to balance forensic analysis needs with storage cost constraints.
Module 5: Resource Optimization and Cost-Performance Trade-offs
- Selecting reserved instances vs. spot instances based on application fault tolerance and uptime requirements.
- Right-sizing VMs using performance telemetry to eliminate over-provisioning without risking SLA breaches.
- Implementing vertical pod autoscaling in Kubernetes while managing application restart frequency.
- Evaluating cold start latency of serverless functions against always-on container workloads for real-time APIs.
- Using storage tiering (e.g., S3 Standard vs. Glacier) with lifecycle policies based on access patterns.
- Optimizing database indexing strategies to reduce query time while managing write performance overhead.
Module 6: Security, Compliance, and Performance Interdependencies
- Assessing encryption overhead (e.g., TLS 1.3 vs. 1.2) on API response times in high-throughput systems.
- Configuring WAF rules to mitigate DDoS attacks without introducing unacceptable request processing latency.
- Implementing just-in-time access controls while maintaining session persistence for performance-sensitive applications.
- Integrating secrets management (e.g., HashiCorp Vault) with minimal impact on application startup time.
- Validating audit logging mechanisms do not saturate disk I/O on transaction-heavy database instances.
- Negotiating compliance audit frequency to reduce performance impact of continuous monitoring scans.
Module 7: Operational Governance and Performance Lifecycle Management
- Establishing change advisory board (CAB) review criteria for performance-impacting infrastructure changes.
- Defining rollback procedures for deployments that exceed latency or error rate thresholds.
- Conducting quarterly performance regression testing after cloud platform updates or patches.
- Enforcing tagging standards for resources to enable accurate performance and cost attribution by team.
- Managing technical debt by scheduling refactoring windows for performance-critical legacy components.
- Integrating performance KPIs into incident response playbooks for faster resolution of outages.
Module 8: Cross-Functional Collaboration and Stakeholder Alignment
- Facilitating joint performance testing sessions between development, operations, and business units.
- Translating technical performance metrics into business impact terms for executive reporting.
- Resolving conflicts between development velocity and production stability in CI/CD pipeline design.
- Coordinating capacity planning cycles with finance teams for budget-constrained scaling initiatives.
- Documenting service dependencies for incident management during cross-team outages.
- Aligning DR drill schedules with business operations to minimize disruption to performance-sensitive workflows.