This curriculum spans the breadth of a multi-workshop research program embedded within a live DevOps environment, covering the technical, organisational, and ethical dimensions of conducting systematic inquiry across CI/CD pipelines, production systems, and engineering teams.
Module 1: Defining Research Objectives Aligned with DevOps Outcomes
- Selecting measurable KPIs such as deployment frequency and mean time to recovery to frame research questions that reflect actual system performance.
- Determining whether to conduct exploratory research (e.g., identifying bottlenecks in CI/CD pipelines) or confirmatory research (e.g., validating the impact of automated testing on release stability).
- Deciding between internal research using telemetry data versus external benchmarking against industry standards like DORA metrics.
- Negotiating access to production system logs and monitoring tools while adhering to data governance policies and privacy regulations.
- Establishing boundaries for research scope when multiple teams share infrastructure, ensuring findings are attributable and actionable.
- Documenting assumptions about toolchain behavior (e.g., Jenkins pipeline execution times) that may influence hypothesis validity.
Module 2: Instrumentation and Data Collection in Production Environments
- Configuring observability tools (e.g., Prometheus, OpenTelemetry) to capture granular timing data from CI/CD stages without introducing performance overhead.
- Designing log schemas that standardize event tagging across microservices to enable cross-system analysis.
- Implementing sampling strategies for high-volume events (e.g., build triggers) to balance data completeness with storage costs.
- Integrating feature flags with telemetry to isolate and measure the impact of specific code changes on deployment reliability.
- Handling personally identifiable information (PII) in logs by applying masking or tokenization before ingestion into analytics platforms.
- Validating timestamp synchronization across distributed systems to ensure accurate sequence reconstruction during incident analysis.
Module 3: Experimental Design in Continuous Delivery Pipelines
- Structuring A/B tests to compare deployment strategies (e.g., blue-green vs. canary) using rollback rate as a primary outcome metric.
- Randomizing build agent assignment in CI environments to eliminate hardware bias in performance measurements.
- Defining control and treatment groups when testing new linter rules, ensuring codebase homogeneity across samples.
- Calculating minimum detectable effect sizes for pipeline duration improvements to avoid underpowered experiments.
- Coordinating experiment windows with release schedules to prevent interference from concurrent changes.
- Implementing circuit breakers in experimental monitoring jobs to halt data collection if system load exceeds thresholds.
Module 4: Analyzing Feedback Loops in Development Workflows
- Mapping feedback latency from test failure notifications to developer response actions using ticketing system timestamps.
- Correlating code review duration with post-merge defect rates to assess quality gate effectiveness.
- Identifying feedback desensitization patterns where teams ignore repeated static analysis warnings.
- Quantifying the impact of pipeline flakiness on developer trust by tracking manual override frequency.
- Segmenting feedback loop analysis by team size and domain complexity to uncover context-specific bottlenecks.
- Using survival analysis to model time-to-resolution for failed builds across different error types.
Module 5: Integrating Human Factors into System Performance Research
- Conducting structured post-incident interviews to extract cognitive factors influencing on-call decision-making.
- Correlating team on-call rotation schedules with incident recurrence rates to assess fatigue effects.
- Measuring toolchain usability through task completion rates during simulated deployment scenarios.
- Analyzing chatOps message patterns to identify communication breakdowns during incident response.
- Mapping role-based access patterns to change approval delays in governance workflows.
- Evaluating the impact of documentation discoverability on mean time to recovery using search log analysis.
Module 6: Governance and Ethics in Operational Research
- Obtaining informed consent from engineering staff when collecting behavioral data from IDE plugins or time-tracking tools.
- Establishing data retention policies for research datasets containing build credentials or access patterns.
- Defining anonymization protocols for publishing internal findings externally while preserving data utility.
- Creating audit trails for research queries that access production monitoring systems to meet compliance requirements.
- Reconciling research data ownership between platform teams and product engineering units.
- Implementing access controls to prevent researchers from inadvertently triggering operational actions via monitoring interfaces.
Module 7: Translating Research Insights into Process Improvements
- Prioritizing remediation efforts based on root cause analysis of recurring pipeline failures.
- Redesigning alert thresholds using statistical process control methods derived from historical incident data.
- Iterating on onboarding checklists using drop-off rates from new developer setup logs.
- Refactoring deployment scripts to eliminate anti-patterns identified through code churn analysis.
- Adjusting capacity planning models based on empirical build resource utilization trends.
- Updating incident review templates to include data-driven prompts that guide teams toward systemic fixes.