This curriculum spans the breadth of a multi-workshop reliability engineering program, covering the technical, operational, and organizational practices required to conduct root cause analyses that inform system design, incident response, and cross-team coordination in production environments.
Module 1: Defining Availability Requirements and SLIs
- Selecting appropriate service level indicators (SLIs) such as request success rate, latency thresholds, or task completion status based on business-critical workflows.
- Negotiating SLI definitions with product and operations teams to ensure alignment between technical feasibility and customer expectations.
- Implementing instrumentation to capture user-impacting metrics at the edge rather than relying solely on internal system health checks.
- Deciding between count-based and duration-based availability measurements for batch processing systems with intermittent workloads.
- Handling ambiguous user journeys by defining synthetic transaction paths to represent real user behavior for monitoring.
- Managing the trade-off between precision in SLI calculation and performance overhead from high-cardinality metric collection.
- Documenting edge cases where SLIs may not reflect actual user experience, such as partial functionality behind feature flags.
Module 2: Instrumentation and Observability Architecture
- Designing distributed tracing pipelines to correlate frontend requests with backend service calls across microservices.
- Choosing between agent-based and code-instrumented telemetry collection based on runtime constraints and team ownership models.
- Implementing structured logging with consistent schema enforcement across polyglot services to enable reliable parsing.
- Configuring sampling strategies for traces to balance storage costs with the ability to reconstruct failure scenarios.
- Integrating business event streams (e.g., order submission, login attempts) into observability platforms for user-centric analysis.
- Establishing data retention policies for logs, metrics, and traces that support RCA while complying with data governance requirements.
- Validating end-to-end signal propagation during deployments to ensure telemetry continues across service boundaries.
Module 3: Incident Detection and Alerting Strategy
- Setting dynamic thresholds for availability alerts using historical baselines instead of static values to reduce false positives.
- Designing alerting rules that trigger on user-impacting conditions rather than infrastructure-level anomalies.
- Implementing alert muting and routing policies to prevent notification fatigue during planned maintenance windows.
- Configuring multi-dimensional alert aggregation to avoid duplication across shards or regions.
- Integrating alert suppression mechanisms for known issues tracked in incident management systems.
- Validating alert fidelity by conducting periodic alert postmortems to identify missed or spurious detections.
- Coordinating on-call rotations with alert ownership to ensure alerts route to teams with remediation authority.
Module 4: Structured Root Cause Analysis Methodology
- Selecting between timeline-based, fault tree, and fishbone analysis based on incident complexity and available data.
- Establishing a standardized incident timeline with precise timestamps from logs, traces, and monitoring systems.
- Isolating contributing factors by analyzing dependencies during the incident window using service dependency graphs.
- Conducting blameless data reviews by focusing on process gaps rather than individual actions.
- Using hypothesis-driven investigation to prioritize potential causes based on likelihood and impact.
- Documenting all discarded hypotheses with evidence to prevent recurrence of incorrect assumptions.
- Integrating third-party service status data into RCA timelines when external dependencies are involved.
Module 5: Dependency and Cascading Failure Analysis
- Mapping synchronous versus asynchronous dependencies to assess blast radius during partial outages.
- Identifying hidden dependencies through runtime tracing rather than relying on documentation or architecture diagrams.
- Implementing circuit breakers and bulkheads based on observed failure propagation patterns from past incidents.
- Quantifying retry storm amplification by analyzing request multiplier effects during dependency degradation.
- Assessing the impact of degraded responses (e.g., timeouts, partial data) versus complete failures on end-user availability.
- Reconstructing queue backlogs in asynchronous systems to determine saturation points during cascading failures.
- Coordinating cross-team RCAs when failures originate in shared platform components.
Module 6: Change and Deployment Correlation
- Linking deployment timelines with availability dips using automated change data capture from CI/CD systems.
- Distinguishing between rollout-related failures and coincidental timing using canary analysis and traffic segmentation.
- Implementing pre-deployment availability checks to validate health endpoints and critical workflows before full release.
- Enforcing deployment freeze policies during high-risk business periods based on historical incident data.
- Using feature flag telemetry to isolate functionality changes from deployment events during RCA.
- Reconstructing configuration drift across environments when infrastructure-as-code is inconsistently applied.
- Correlating third-party library updates with memory leak or latency regressions observed in production.
Module 7: Human and Process Factors in Availability
- Analyzing incident response delays due to unclear escalation paths or missing runbook procedures.
- Reviewing communication breakdowns in war room coordination using recorded bridge logs and chat transcripts.
- Assessing cognitive load during incidents by evaluating the number of systems an engineer must monitor simultaneously.
- Identifying knowledge silos by mapping incident resolution to individual contributors across past outages.
- Measuring mean time to acknowledge (MTTA) and mean time to mitigate (MTTM) to benchmark team responsiveness.
- Integrating post-incident training into onboarding based on recurring human error patterns.
- Designing runbooks with decision trees that reflect actual troubleshooting workflows, not idealized procedures.
Module 8: Remediation Planning and Verification
- Prioritizing remediation tasks based on recurrence likelihood and potential impact using risk scoring models.
- Defining measurable success criteria for fixes, such as reduction in error budget consumption or alert frequency.
- Implementing automated verification tests that validate fixes in staging environments before production rollout.
- Tracking remediation completion through integration with ticketing systems and sprint planning tools.
- Conducting follow-up reviews 30–60 days after implementation to assess long-term effectiveness.
- Updating monitoring dashboards and alerting rules to reflect new failure modes post-remediation.
- Adjusting error budget policies based on updated reliability targets after architectural changes.
Module 9: Organizational Learning and Feedback Loops
- Standardizing incident report templates to ensure consistent data capture across teams and time.
- Integrating RCA findings into architecture review boards to influence future system design decisions.
- Creating targeted reliability training modules based on recurring root causes across incidents.
- Generating executive summaries that translate technical findings into business risk and cost implications.
- Using trend analysis to identify systemic issues, such as recurring configuration errors or testing gaps.
- Establishing feedback mechanisms from SRE teams to product managers for reliability requirement refinement.
- Archiving RCA artifacts in searchable knowledge bases with metadata tagging for future reference.