Description

This curriculum spans the breadth of a multi-workshop research program embedded within a live DevOps environment, covering the technical, organisational, and ethical dimensions of conducting systematic inquiry across CI/CD pipelines, production systems, and engineering teams.

Module 1: Defining Research Objectives Aligned with DevOps Outcomes

Selecting measurable KPIs such as deployment frequency and mean time to recovery to frame research questions that reflect actual system performance.
Determining whether to conduct exploratory research (e.g., identifying bottlenecks in CI/CD pipelines) or confirmatory research (e.g., validating the impact of automated testing on release stability).
Deciding between internal research using telemetry data versus external benchmarking against industry standards like DORA metrics.
Negotiating access to production system logs and monitoring tools while adhering to data governance policies and privacy regulations.
Establishing boundaries for research scope when multiple teams share infrastructure, ensuring findings are attributable and actionable.
Documenting assumptions about toolchain behavior (e.g., Jenkins pipeline execution times) that may influence hypothesis validity.

Module 2: Instrumentation and Data Collection in Production Environments

Configuring observability tools (e.g., Prometheus, OpenTelemetry) to capture granular timing data from CI/CD stages without introducing performance overhead.
Designing log schemas that standardize event tagging across microservices to enable cross-system analysis.
Implementing sampling strategies for high-volume events (e.g., build triggers) to balance data completeness with storage costs.
Integrating feature flags with telemetry to isolate and measure the impact of specific code changes on deployment reliability.
Handling personally identifiable information (PII) in logs by applying masking or tokenization before ingestion into analytics platforms.
Validating timestamp synchronization across distributed systems to ensure accurate sequence reconstruction during incident analysis.

Module 3: Experimental Design in Continuous Delivery Pipelines

Structuring A/B tests to compare deployment strategies (e.g., blue-green vs. canary) using rollback rate as a primary outcome metric.
Randomizing build agent assignment in CI environments to eliminate hardware bias in performance measurements.
Defining control and treatment groups when testing new linter rules, ensuring codebase homogeneity across samples.
Calculating minimum detectable effect sizes for pipeline duration improvements to avoid underpowered experiments.
Coordinating experiment windows with release schedules to prevent interference from concurrent changes.
Implementing circuit breakers in experimental monitoring jobs to halt data collection if system load exceeds thresholds.

Module 4: Analyzing Feedback Loops in Development Workflows

Mapping feedback latency from test failure notifications to developer response actions using ticketing system timestamps.
Correlating code review duration with post-merge defect rates to assess quality gate effectiveness.
Identifying feedback desensitization patterns where teams ignore repeated static analysis warnings.
Quantifying the impact of pipeline flakiness on developer trust by tracking manual override frequency.
Segmenting feedback loop analysis by team size and domain complexity to uncover context-specific bottlenecks.
Using survival analysis to model time-to-resolution for failed builds across different error types.

Module 5: Integrating Human Factors into System Performance Research

Conducting structured post-incident interviews to extract cognitive factors influencing on-call decision-making.
Correlating team on-call rotation schedules with incident recurrence rates to assess fatigue effects.
Measuring toolchain usability through task completion rates during simulated deployment scenarios.
Analyzing chatOps message patterns to identify communication breakdowns during incident response.
Mapping role-based access patterns to change approval delays in governance workflows.
Evaluating the impact of documentation discoverability on mean time to recovery using search log analysis.

Module 6: Governance and Ethics in Operational Research

Obtaining informed consent from engineering staff when collecting behavioral data from IDE plugins or time-tracking tools.
Establishing data retention policies for research datasets containing build credentials or access patterns.
Defining anonymization protocols for publishing internal findings externally while preserving data utility.
Creating audit trails for research queries that access production monitoring systems to meet compliance requirements.
Reconciling research data ownership between platform teams and product engineering units.
Implementing access controls to prevent researchers from inadvertently triggering operational actions via monitoring interfaces.

Module 7: Translating Research Insights into Process Improvements

Prioritizing remediation efforts based on root cause analysis of recurring pipeline failures.
Redesigning alert thresholds using statistical process control methods derived from historical incident data.
Iterating on onboarding checklists using drop-off rates from new developer setup logs.
Refactoring deployment scripts to eliminate anti-patterns identified through code churn analysis.
Adjusting capacity planning models based on empirical build resource utilization trends.
Updating incident review templates to include data-driven prompts that guide teams toward systemic fixes.