Description

This curriculum spans the equivalent depth and structure of a multi-workshop incident management upskilling program, covering detection, triage, forensic analysis, and systemic prevention across distributed systems, comparable to internal SRE capability builds in high-velocity technology organisations.

Module 1: Incident Detection and Alerting Infrastructure

Configure threshold-based monitoring rules in Prometheus to distinguish between transient spikes and sustained service degradation without generating alert fatigue.
Integrate application-level health checks into Kubernetes liveness and readiness probes to prevent traffic routing to unstable pods during crash recovery.
Design alert routing in PagerDuty to escalate software crash alerts based on service criticality and on-call rotation schedules.
Implement structured logging in applications using OpenTelemetry to ensure crash-related log entries include trace IDs and stack traces for root cause analysis.
Balance sensitivity of anomaly detection algorithms in Datadog to reduce false positives while maintaining detection of subtle memory leak patterns preceding crashes.
Deploy synthetic transaction monitors to simulate user workflows and detect functional outages that may not trigger infrastructure-level alerts.

Module 2: Crash Triage and Initial Response Protocols

Define runbook procedures for initial triage that mandate collection of core dump files, log snippets, and recent deployment metadata before system restarts.
Establish criteria for declaring a Sev-1 incident based on user impact metrics such as error rate, transaction failure volume, and geographic scope.
Assign role-based permissions in incident response tools (e.g., FireHydrant) to ensure only authorized personnel can trigger rollback or failover actions.
Initiate parallel investigation tracks: one for immediate mitigation, another for preserving forensic data for later analysis.
Document all command-line interventions in a shared incident timeline to maintain auditability and prevent conflicting actions.
Decide whether to drain or isolate a crashing node based on risk of cascading failures versus need for diagnostic access.

Module 3: Forensic Data Collection and Preservation

Configure JVM to generate heap dumps onOutOfMemoryError and ensure sufficient disk space and file permissions are available in production environments.
Implement automatic crash dump collection for native applications using Windows Error Reporting (WER) or Linux abrt with secure upload to centralized storage.
Enforce retention policies for diagnostic artifacts based on data sensitivity and compliance requirements (e.g., GDPR, HIPAA).
Use eBPF probes to capture system call sequences leading up to a crash without introducing runtime overhead in steady state.
Validate integrity of collected memory dumps using checksums before transferring across network boundaries.
Restrict access to crash artifacts using short-lived credentials and audit all access attempts in SIEM systems.

Module 4: Root Cause Analysis Methodologies

Apply the 5 Whys technique to a recent software crash, ensuring each "why" is supported by empirical evidence from logs or dumps.
Correlate timestamps of crash events with recent deployment windows to determine if a new code release introduced the failure.
Use flame graphs generated from CPU profiling data to identify functions consuming disproportionate resources prior to crash.
Reproduce crash conditions in a staging environment by replaying captured network traffic using tools like tcpreplay.
Validate hypotheses about race conditions by running instrumented builds under stress testing frameworks such as JMeter or k6.
Differentiate between memory corruption due to application bugs versus faulty hardware using memory testing tools like memtest86.

Module 5: Mitigation and Recovery Strategies

Implement circuit breaker patterns in service mesh configurations (e.g., Istio) to prevent cascading failures during partial outages.
Execute controlled rollbacks using GitOps workflows in ArgoCD, ensuring configuration drift is captured and reviewed.
Scale out stateless services to absorb load while diagnosing crashes in stateful components with limited horizontal scalability.
Apply hotfixes via blue-green deployment to minimize downtime while maintaining the ability to revert quickly.
Restore service by replaying message queues from durable storage after fixing a consumer crash loop.
Temporarily disable non-critical features via feature flags to stabilize the system while preserving core functionality.

Module 6: Post-Incident Review and Process Improvement

Conduct blameless post-mortems with mandatory participation from engineering, SRE, and product stakeholders to capture systemic factors.
Classify contributing factors using a taxonomy such as TIM (Timeline, Impact, Mechanism) to standardize future analysis.
Track remediation tasks in Jira with dependencies, owners, and deadlines, integrating with incident management platform via API.
Measure MTTR (Mean Time to Recovery) and MTTD (Mean Time to Detect) across incidents to identify bottlenecks in response workflows.
Update monitoring dashboards to reflect new failure modes discovered during recent incidents.
Revise service-level objectives (SLOs) based on observed reliability patterns and user tolerance for downtime.

Module 7: Resilience Engineering and Crash Prevention

Introduce chaos engineering experiments using Gremlin to simulate process crashes and validate recovery automation.
Enforce compile-time checks for null references and array bounds in critical code paths using static analysis tools.
Implement automated canary analysis using Kayenta to detect performance regressions before full rollout.
Require crash injection testing in CI pipelines for components handling untrusted input or external APIs.
Design retry logic with exponential backoff and jitter to prevent thundering herd problems after service restoration.
Conduct architecture review boards (ARBs) to evaluate new services for single points of failure and crash propagation risks.