Description

This curriculum spans the breadth of a multi-workshop reliability engineering program, covering the technical, operational, and organizational practices required to conduct root cause analyses that inform system design, incident response, and cross-team coordination in production environments.

Module 1: Defining Availability Requirements and SLIs

Selecting appropriate service level indicators (SLIs) such as request success rate, latency thresholds, or task completion status based on business-critical workflows.
Negotiating SLI definitions with product and operations teams to ensure alignment between technical feasibility and customer expectations.
Implementing instrumentation to capture user-impacting metrics at the edge rather than relying solely on internal system health checks.
Deciding between count-based and duration-based availability measurements for batch processing systems with intermittent workloads.
Handling ambiguous user journeys by defining synthetic transaction paths to represent real user behavior for monitoring.
Managing the trade-off between precision in SLI calculation and performance overhead from high-cardinality metric collection.
Documenting edge cases where SLIs may not reflect actual user experience, such as partial functionality behind feature flags.

Module 2: Instrumentation and Observability Architecture

Designing distributed tracing pipelines to correlate frontend requests with backend service calls across microservices.
Choosing between agent-based and code-instrumented telemetry collection based on runtime constraints and team ownership models.
Implementing structured logging with consistent schema enforcement across polyglot services to enable reliable parsing.
Configuring sampling strategies for traces to balance storage costs with the ability to reconstruct failure scenarios.
Integrating business event streams (e.g., order submission, login attempts) into observability platforms for user-centric analysis.
Establishing data retention policies for logs, metrics, and traces that support RCA while complying with data governance requirements.
Validating end-to-end signal propagation during deployments to ensure telemetry continues across service boundaries.

Module 3: Incident Detection and Alerting Strategy

Setting dynamic thresholds for availability alerts using historical baselines instead of static values to reduce false positives.
Designing alerting rules that trigger on user-impacting conditions rather than infrastructure-level anomalies.
Implementing alert muting and routing policies to prevent notification fatigue during planned maintenance windows.
Configuring multi-dimensional alert aggregation to avoid duplication across shards or regions.
Integrating alert suppression mechanisms for known issues tracked in incident management systems.
Validating alert fidelity by conducting periodic alert postmortems to identify missed or spurious detections.
Coordinating on-call rotations with alert ownership to ensure alerts route to teams with remediation authority.

Module 4: Structured Root Cause Analysis Methodology

Selecting between timeline-based, fault tree, and fishbone analysis based on incident complexity and available data.
Establishing a standardized incident timeline with precise timestamps from logs, traces, and monitoring systems.
Isolating contributing factors by analyzing dependencies during the incident window using service dependency graphs.
Conducting blameless data reviews by focusing on process gaps rather than individual actions.
Using hypothesis-driven investigation to prioritize potential causes based on likelihood and impact.
Documenting all discarded hypotheses with evidence to prevent recurrence of incorrect assumptions.
Integrating third-party service status data into RCA timelines when external dependencies are involved.

Module 5: Dependency and Cascading Failure Analysis

Mapping synchronous versus asynchronous dependencies to assess blast radius during partial outages.
Identifying hidden dependencies through runtime tracing rather than relying on documentation or architecture diagrams.
Implementing circuit breakers and bulkheads based on observed failure propagation patterns from past incidents.
Quantifying retry storm amplification by analyzing request multiplier effects during dependency degradation.
Assessing the impact of degraded responses (e.g., timeouts, partial data) versus complete failures on end-user availability.
Reconstructing queue backlogs in asynchronous systems to determine saturation points during cascading failures.
Coordinating cross-team RCAs when failures originate in shared platform components.

Module 6: Change and Deployment Correlation

Linking deployment timelines with availability dips using automated change data capture from CI/CD systems.
Distinguishing between rollout-related failures and coincidental timing using canary analysis and traffic segmentation.
Implementing pre-deployment availability checks to validate health endpoints and critical workflows before full release.
Enforcing deployment freeze policies during high-risk business periods based on historical incident data.
Using feature flag telemetry to isolate functionality changes from deployment events during RCA.
Reconstructing configuration drift across environments when infrastructure-as-code is inconsistently applied.
Correlating third-party library updates with memory leak or latency regressions observed in production.

Module 7: Human and Process Factors in Availability

Analyzing incident response delays due to unclear escalation paths or missing runbook procedures.
Reviewing communication breakdowns in war room coordination using recorded bridge logs and chat transcripts.
Assessing cognitive load during incidents by evaluating the number of systems an engineer must monitor simultaneously.
Identifying knowledge silos by mapping incident resolution to individual contributors across past outages.
Measuring mean time to acknowledge (MTTA) and mean time to mitigate (MTTM) to benchmark team responsiveness.
Integrating post-incident training into onboarding based on recurring human error patterns.
Designing runbooks with decision trees that reflect actual troubleshooting workflows, not idealized procedures.

Module 8: Remediation Planning and Verification

Prioritizing remediation tasks based on recurrence likelihood and potential impact using risk scoring models.
Defining measurable success criteria for fixes, such as reduction in error budget consumption or alert frequency.
Implementing automated verification tests that validate fixes in staging environments before production rollout.
Tracking remediation completion through integration with ticketing systems and sprint planning tools.
Conducting follow-up reviews 30–60 days after implementation to assess long-term effectiveness.
Updating monitoring dashboards and alerting rules to reflect new failure modes post-remediation.
Adjusting error budget policies based on updated reliability targets after architectural changes.

Module 9: Organizational Learning and Feedback Loops

Standardizing incident report templates to ensure consistent data capture across teams and time.
Integrating RCA findings into architecture review boards to influence future system design decisions.
Creating targeted reliability training modules based on recurring root causes across incidents.
Generating executive summaries that translate technical findings into business risk and cost implications.
Using trend analysis to identify systemic issues, such as recurring configuration errors or testing gaps.
Establishing feedback mechanisms from SRE teams to product managers for reliability requirement refinement.
Archiving RCA artifacts in searchable knowledge bases with metadata tagging for future reference.