Name: Production workflow in DevOps
Price: 249 USD
Availability: InStock

Description

This curriculum spans the design and governance of production-grade DevOps workflows at the scale of multi-team platform engineering programs, covering the technical, procedural, and coordination challenges typical in regulated or large-scale software organisations.

Module 1: Infrastructure as Code (IaC) Design and Governance

Select between declarative (e.g., Terraform) and imperative (e.g., Ansible) IaC tools based on team expertise and change control requirements.
Implement module versioning in Terraform to prevent breaking changes across environments during parallel development.
Enforce IaC peer review policies using mandatory pull requests and automated policy checks via Open Policy Agent (OPA).
Balance state file security and accessibility by choosing between remote backends (e.g., S3 with state locking) and local state with access controls.
Design reusable IaC modules with input validation and output standardization to support multi-team adoption.
Integrate drift detection into CI/CD pipelines to identify and remediate configuration deviations from source-controlled templates.

Module 2: CI/CD Pipeline Architecture and Optimization

Decide between monorepo and polyrepo pipeline designs based on team autonomy, release cadence, and dependency management needs.
Implement pipeline parallelization and selective job triggering to reduce feedback time in large codebases.
Configure artifact retention policies in Nexus or Artifactory to balance storage costs with audit and rollback requirements.
Enforce pipeline immutability by signing and versioning pipeline definitions in source control.
Integrate canary analysis into deployment stages using metrics from Prometheus and logs from Loki to gate progression.
Design pipeline rollback mechanisms that include artifact reversion, configuration reset, and database migration rollback coordination.

Module 3: Secure Software Supply Chain

Enforce SBOM (Software Bill of Materials) generation at build time and integrate into vulnerability scanning workflows.
Implement signed commits and artifact signing using Sigstore or Notary to prevent unauthorized code injection.
Configure dependency scanning tools (e.g., Dependabot, Snyk) with policy thresholds that align with risk tolerance and remediation capacity.
Isolate build environments using ephemeral runners with minimal privileges to reduce attack surface.
Integrate attestations into the pipeline using in-toto or Cosign to verify build provenance and integrity.
Define and enforce admission controls for container images in the registry using OPA or Kyverno policies.

Module 4: Observability and Telemetry Integration

Standardize log structure across services using structured logging formats (e.g., JSON) and enforce schema compliance.
Configure distributed tracing with context propagation across microservices using OpenTelemetry instrumentation.
Balance metric granularity and cardinality to prevent Prometheus series explosion while maintaining diagnostic utility.
Implement synthetic monitoring for critical user journeys to detect degradation before real-user impact.
Design alerting rules with actionable thresholds and clear runbook references to reduce mean time to resolution.
Aggregate and correlate telemetry data across logs, metrics, and traces in a centralized observability platform for root cause analysis.

Module 5: Production Deployment Strategies

Select deployment strategy (blue-green, canary, rolling) based on risk profile, rollback requirements, and infrastructure constraints.
Coordinate database schema changes with application deployments using versioned migration scripts and backward compatibility.
Implement feature flags with kill switches to decouple deployment from release and enable controlled rollouts.
Configure traffic shifting in service mesh (e.g., Istio) or API gateway to support gradual canary promotions.
Design health check endpoints that reflect actual service dependencies and readiness for traffic routing.
Validate deployment success using automated smoke tests and performance benchmarks before full cutover.

Module 6: Incident Response and Postmortem Culture

Define incident severity levels with clear escalation paths and communication protocols for on-call teams.
Integrate incident management tools (e.g., PagerDuty, Opsgenie) with monitoring systems to automate alert routing.
Conduct blameless postmortems with structured templates that document timeline, contributing factors, and action items.
Track remediation tasks from postmortems in a public backlog to ensure accountability and follow-through.
Implement runbook automation for common incident scenarios to reduce cognitive load during outages.
Rotate on-call responsibilities with training and shadowing to maintain team resilience and knowledge sharing.

Module 7: Compliance, Auditing, and Change Management

Map CI/CD pipeline stages to audit controls (e.g., SOC 2, ISO 27001) and generate evidence artifacts automatically.
Implement change advisory board (CAB) workflows for high-risk production changes using automated approval gates.
Log all pipeline executions and configuration changes to immutable storage for forensic analysis.
Enforce separation of duties by restricting production deployment permissions to designated roles and requiring dual approvals.
Integrate configuration management database (CMDB) updates into deployment pipelines to maintain accurate asset inventory.
Conduct periodic access reviews for pipeline and infrastructure permissions to enforce least privilege.

Module 8: Scaling DevOps Across Multiple Teams and Environments

Design platform teams to provide self-service tooling (e.g., internal developer platforms) while maintaining security and compliance guardrails.
Standardize environment provisioning using environment templates to reduce configuration drift.
Implement multi-region deployment patterns with failover testing to meet disaster recovery objectives.
Manage cross-team dependencies using contract testing and consumer-driven contracts in integration pipelines.
Balance centralization and decentralization by defining clear ownership boundaries for shared services and tooling.
Measure and report on DORA metrics consistently across teams to identify bottlenecks and track improvement initiatives.