This curriculum spans the design and operational rigor of a multi-workshop reliability engineering program, covering the same depth of practices used in enterprise-scale availability governance, from SLO negotiations and deployment controls to disaster recovery validation and chaos engineering.
Module 1: Defining Availability Requirements and SLIs/SLOs
- Selecting appropriate service level indicators (SLIs) such as request latency, error rate, or throughput based on user-facing impact.
- Negotiating SLO thresholds with product and operations teams while accounting for technical debt and legacy dependencies.
- Determining burn rate policies for error budget consumption and defining alerting triggers accordingly.
- Mapping user journeys to backend services to identify critical paths influencing availability.
- Documenting exceptions for scheduled maintenance windows in SLO calculations to avoid false violations.
- Aligning SLO definitions across multi-region deployments where regional outages may not impact global availability.
- Implementing synthetic monitoring to simulate user transactions and validate SLI accuracy.
- Handling discrepancies between infrastructure-level metrics (e.g., CPU) and service-level availability.
Module 2: Change Management and Deployment Controls
- Enforcing mandatory peer review gates for production configuration changes in version-controlled infrastructure.
- Implementing time-based deployment freezes during peak business periods or critical events.
- Configuring automated rollback triggers based on health check failures or SLO breaches post-deployment.
- Integrating deployment pipelines with incident management systems to prevent releases during active outages.
- Requiring pre-deployment dependency impact analysis for shared services and databases.
- Managing exceptions for emergency fixes while maintaining audit trails and post-mortem requirements.
- Enforcing canary analysis duration based on traffic volume and error signal stabilization.
- Restricting direct production access through bastion hosts or Just-In-Time (JIT) elevation workflows.
Module 3: System Decomposition and Dependency Governance
- Identifying and cataloging transitive dependencies that introduce hidden availability risks.
- Enforcing circuit breaker patterns in service clients to prevent cascading failures during downstream outages.
- Negotiating SLAs with third-party providers and aligning internal SLOs accordingly.
- Implementing dependency health dashboards that aggregate status across internal and external services.
- Deciding between synchronous and asynchronous integration patterns based on availability requirements.
- Enforcing version pinning or semantic versioning policies to avoid unexpected breaking changes.
- Conducting dependency impact assessments before decommissioning shared platforms.
- Managing shared database schemas across teams to prevent uncoordinated breaking changes.
Module 4: Automated Monitoring and Alerting Strategy
- Reducing alert fatigue by applying signal-to-noise filtering using error budgets and burn rates.
- Designing alerting rules based on symptoms (e.g., latency, errors) rather than causes (e.g., CPU).
- Implementing dynamic thresholds for metrics that vary by time of day or business cycle.
- Validating alert routing paths during team on-call rotations and escalation changes.
- Suppressing non-actionable alerts during planned maintenance or known outages.
- Correlating alerts across services to detect systemic issues versus isolated incidents.
- Ensuring monitoring agents are deployed with high availability and self-health checks.
- Managing retention policies for time-series data based on incident investigation needs.
Module 5: Incident Response and On-Call Operations
- Defining escalation paths for incidents that exceed team resolution capabilities or SLAs.
- Conducting blameless post-mortems with mandatory action item tracking and follow-up deadlines.
- Standardizing incident communication templates for internal stakeholders and customer-facing teams.
- Rotating on-call responsibilities with adequate ramp-up periods and shadowing requirements.
- Implementing war room coordination protocols for cross-team incidents.
- Validating incident response runbooks through periodic fire drills and simulation exercises.
- Integrating incident timelines with monitoring and deployment data for root cause analysis.
- Managing fatigue risk by enforcing maximum on-call duration and compensatory time off.
Module 6: Disaster Recovery and Failover Planning
- Classifying systems by recovery time objective (RTO) and recovery point objective (RPO) tiers.
- Validating failover procedures for stateful services such as databases and message queues.
- Managing DNS failover configurations with appropriate TTL settings and health checks.
- Replicating configuration secrets and credentials across regions using secure vault solutions.
- Conducting scheduled failover drills with rollback validation and performance benchmarking.
- Handling data consistency issues during partial or asymmetric regional outages.
- Documenting manual intervention steps required when automated failover fails or is unsafe.
- Ensuring backup retention policies support compliance and forensic recovery needs.
Module 7: Capacity Planning and Load Management
- Forecasting capacity needs based on historical growth trends and upcoming product launches.
- Implementing autoscaling policies with cooldown periods to prevent thrashing.
- Setting up load shedding mechanisms to reject non-critical traffic during overload conditions.
- Conducting load testing under realistic traffic patterns, including spike and sustained loads.
- Managing resource quotas and limits in multi-tenant environments to prevent noisy neighbors.
- Monitoring queue depths and backpressure signals in asynchronous processing pipelines.
- Planning for cold start scenarios in serverless environments during traffic surges.
- Right-sizing instance types based on memory, CPU, and I/O bottlenecks observed in production.
Module 8: Configuration and Drift Management
- Enforcing immutable infrastructure patterns to eliminate configuration drift in production.
- Implementing continuous configuration compliance checks using policy-as-code tools.
- Managing environment-specific configurations through secure parameter stores or config servers.
- Tracking configuration changes through audit logs and linking them to change requests.
- Handling emergency configuration overrides with automatic expiration and notification.
- Standardizing base images and OS patch levels across fleets to reduce variability.
- Validating configuration templates against schema and security policies pre-deployment.
- Reconciling configuration differences between staging and production environments.
Module 9: Availability Testing and Resilience Validation
- Designing chaos engineering experiments that target specific failure modes without violating SLOs.
- Scheduling resilience tests during low-traffic periods with rollback and monitoring safeguards.
- Injecting network latency and packet loss to validate timeout and retry logic in clients.
- Testing state recovery procedures after simulated node or zone failures.
- Measuring recovery time from backup restores under realistic data volume conditions.
- Validating circuit breaker state transitions and fallback behavior under load.
- Coordinating cross-team resilience tests involving shared platforms and dependencies.
- Documenting test outcomes and updating runbooks or architecture based on findings.