This curriculum spans the equivalent depth and structure of a multi-workshop incident management upskilling program, covering detection, triage, forensic analysis, and systemic prevention across distributed systems, comparable to internal SRE capability builds in high-velocity technology organisations.
Module 1: Incident Detection and Alerting Infrastructure
- Configure threshold-based monitoring rules in Prometheus to distinguish between transient spikes and sustained service degradation without generating alert fatigue.
- Integrate application-level health checks into Kubernetes liveness and readiness probes to prevent traffic routing to unstable pods during crash recovery.
- Design alert routing in PagerDuty to escalate software crash alerts based on service criticality and on-call rotation schedules.
- Implement structured logging in applications using OpenTelemetry to ensure crash-related log entries include trace IDs and stack traces for root cause analysis.
- Balance sensitivity of anomaly detection algorithms in Datadog to reduce false positives while maintaining detection of subtle memory leak patterns preceding crashes.
- Deploy synthetic transaction monitors to simulate user workflows and detect functional outages that may not trigger infrastructure-level alerts.
Module 2: Crash Triage and Initial Response Protocols
- Define runbook procedures for initial triage that mandate collection of core dump files, log snippets, and recent deployment metadata before system restarts.
- Establish criteria for declaring a Sev-1 incident based on user impact metrics such as error rate, transaction failure volume, and geographic scope.
- Assign role-based permissions in incident response tools (e.g., FireHydrant) to ensure only authorized personnel can trigger rollback or failover actions.
- Initiate parallel investigation tracks: one for immediate mitigation, another for preserving forensic data for later analysis.
- Document all command-line interventions in a shared incident timeline to maintain auditability and prevent conflicting actions.
- Decide whether to drain or isolate a crashing node based on risk of cascading failures versus need for diagnostic access.
Module 3: Forensic Data Collection and Preservation
- Configure JVM to generate heap dumps onOutOfMemoryError and ensure sufficient disk space and file permissions are available in production environments.
- Implement automatic crash dump collection for native applications using Windows Error Reporting (WER) or Linux abrt with secure upload to centralized storage.
- Enforce retention policies for diagnostic artifacts based on data sensitivity and compliance requirements (e.g., GDPR, HIPAA).
- Use eBPF probes to capture system call sequences leading up to a crash without introducing runtime overhead in steady state.
- Validate integrity of collected memory dumps using checksums before transferring across network boundaries.
- Restrict access to crash artifacts using short-lived credentials and audit all access attempts in SIEM systems.
Module 4: Root Cause Analysis Methodologies
- Apply the 5 Whys technique to a recent software crash, ensuring each "why" is supported by empirical evidence from logs or dumps.
- Correlate timestamps of crash events with recent deployment windows to determine if a new code release introduced the failure.
- Use flame graphs generated from CPU profiling data to identify functions consuming disproportionate resources prior to crash.
- Reproduce crash conditions in a staging environment by replaying captured network traffic using tools like tcpreplay.
- Validate hypotheses about race conditions by running instrumented builds under stress testing frameworks such as JMeter or k6.
- Differentiate between memory corruption due to application bugs versus faulty hardware using memory testing tools like memtest86.
Module 5: Mitigation and Recovery Strategies
- Implement circuit breaker patterns in service mesh configurations (e.g., Istio) to prevent cascading failures during partial outages.
- Execute controlled rollbacks using GitOps workflows in ArgoCD, ensuring configuration drift is captured and reviewed.
- Scale out stateless services to absorb load while diagnosing crashes in stateful components with limited horizontal scalability.
- Apply hotfixes via blue-green deployment to minimize downtime while maintaining the ability to revert quickly.
- Restore service by replaying message queues from durable storage after fixing a consumer crash loop.
- Temporarily disable non-critical features via feature flags to stabilize the system while preserving core functionality.
Module 6: Post-Incident Review and Process Improvement
- Conduct blameless post-mortems with mandatory participation from engineering, SRE, and product stakeholders to capture systemic factors.
- Classify contributing factors using a taxonomy such as TIM (Timeline, Impact, Mechanism) to standardize future analysis.
- Track remediation tasks in Jira with dependencies, owners, and deadlines, integrating with incident management platform via API.
- Measure MTTR (Mean Time to Recovery) and MTTD (Mean Time to Detect) across incidents to identify bottlenecks in response workflows.
- Update monitoring dashboards to reflect new failure modes discovered during recent incidents.
- Revise service-level objectives (SLOs) based on observed reliability patterns and user tolerance for downtime.
Module 7: Resilience Engineering and Crash Prevention
- Introduce chaos engineering experiments using Gremlin to simulate process crashes and validate recovery automation.
- Enforce compile-time checks for null references and array bounds in critical code paths using static analysis tools.
- Implement automated canary analysis using Kayenta to detect performance regressions before full rollout.
- Require crash injection testing in CI pipelines for components handling untrusted input or external APIs.
- Design retry logic with exponential backoff and jitter to prevent thundering herd problems after service restoration.
- Conduct architecture review boards (ARBs) to evaluate new services for single points of failure and crash propagation risks.