This curriculum spans the operational rigor of a multi-workshop program on engineering management, covering the same scope as an internal capability build for resource governance across development, infrastructure, and compliance functions in a mid-to-large software organisation.
Module 1: Strategic Resource Allocation Across Development Lifecycles
- Determine the allocation of engineering headcount between greenfield development and technical debt remediation based on product roadmap urgency and system stability metrics.
- Adjust sprint capacity planning to account for unplanned production incidents, balancing feature delivery with operational load.
- Decide when to staff cross-functional teams versus specialized roles based on project complexity and delivery timelines.
- Implement capacity forecasting models using historical velocity and release burndown data to inform quarterly resourcing decisions.
- Negotiate resource sharing agreements between product teams during peak delivery periods, including fallback protocols for priority conflicts.
- Integrate product management and engineering leadership in quarterly resource planning sessions to align staffing with business outcomes.
Module 2: Infrastructure Provisioning and Environment Management
- Configure CI/CD pipelines to dynamically allocate ephemeral environments using Kubernetes namespaces with quota enforcement per team.
- Enforce naming conventions and tagging policies for cloud resources to enable cost attribution and lifecycle automation.
- Design environment promotion strategies that balance deployment speed with data isolation requirements for compliance.
- Implement automated teardown of non-production environments after 14 days of inactivity to control cloud spend.
- Configure VPC peering and security group rules to allow controlled access between staging and shared dependency environments.
- Establish quotas on compute instance types in development accounts to prevent accidental over-provisioning.
Module 3: Human Resource Scaling and Team Topology Design
- Select between feature teams and component teams based on domain coupling and release independence requirements.
- Redistribute backend developers across microservices based on incident volume and service-level objectives (SLOs).
- Introduce a platform engineering team to absorb undifferentiated heavy lifting, reducing cognitive load on product teams.
- Decide whether to onboard contractors for time-bound initiatives based on knowledge transfer risk and IP sensitivity.
- Implement team-level on-call rotations with escalation paths and fatigue management rules (e.g., max 1 incident/week for L1).
- Conduct team health checks quarterly to assess burnout risk and adjust workload distribution accordingly.
Module 4: Budget Governance and Cost Accountability
- Assign cost centers to cloud projects and enforce budget alerts at 75%, 90%, and 100% thresholds.
- Require architecture review board (ARB) approval for any resource deployment exceeding $5,000/month in projected spend.
- Allocate cloud costs back to product lines using tag-based chargeback models in financial reporting systems.
- Compare reserved instance utilization against actual usage to determine renewal eligibility and optimize spend.
- Implement policy-as-code checks in IaC pipelines to block non-approved instance types in production.
- Conduct monthly cloud spend reviews with engineering leads to identify anomalies and enforce accountability.
Module 5: Toolchain Standardization and Developer Enablement
- Mandate the use of a centralized template repository for Terraform modules to ensure compliance with security baselines.
- Configure IDE settings and linter rules via organization-wide configurations to reduce configuration drift.
- Deploy internal developer portals with service catalogs to reduce onboarding time for new team members.
- Standardize logging formats and metric exports to enable consistent monitoring across services.
- Restrict use of third-party SaaS tools by requiring security and legal review before procurement.
- Automate dependency scanning in CI to block builds with known vulnerabilities above CVSS 7.0.
Module 6: Cross-Functional Dependency Management
- Map inter-service dependencies using distributed tracing data to identify high-risk integration points.
- Establish service-level agreements (SLAs) between internal teams for API uptime and response latency.
- Coordinate release calendars across teams to avoid overlapping major deployments during critical business periods.
- Implement contract testing in consumer-driven APIs to prevent breaking changes in shared interfaces.
- Design fallback mechanisms for downstream service outages, including circuit breakers and cached responses.
- Facilitate dependency triage meetings when multiple teams require changes to a shared library or platform service.
Module 7: Performance Monitoring and Resource Optimization
- Set autoscaling policies based on observed CPU, memory, and request queue length metrics during peak load.
- Right-size database instances by analyzing query performance and connection pool utilization over 30-day periods.
- Configure APM tools to sample high-latency transactions and correlate them with code deployments.
- Identify underutilized services running at less than 20% CPU over 7 days for consolidation or decommissioning.
- Implement caching strategies at API and database layers based on read/write ratio and data volatility.
- Use flame graphs to pinpoint inefficient code paths and prioritize refactoring efforts by performance impact.
Module 8: Compliance, Audit, and Operational Resilience
- Enforce immutable audit logs for all production deployments using write-once storage and role-based access.
- Conduct quarterly access reviews to revoke unnecessary permissions for decommissioned projects and departed staff.
- Design disaster recovery runbooks with defined RTO and RPO targets, tested via annual fire drills.
- Implement data residency controls by restricting deployment regions in CI/CD pipelines based on customer location.
- Document incident response roles and communication protocols for major outages affecting multiple services.
- Archive deployment artifacts and logs for 7 years to comply with financial regulatory requirements.