Skip to main content

Infrastructure Maintenance in Availability Management

$299.00
How you learn:
Self-paced • Lifetime updates
Who trusts this:
Trusted by professionals in 160+ countries
When you get access:
Course access is prepared after purchase and delivered via email
Toolkit Included:
Includes a practical, ready-to-use toolkit containing implementation templates, worksheets, checklists, and decision-support materials used to accelerate real-world application and reduce setup time.
Your guarantee:
30-day money-back guarantee — no questions asked
Adding to cart… The item has been added

This curriculum spans the technical and operational rigor of a multi-workshop availability engineering program, covering the same depth of planning, execution, and governance tasks seen in enterprise incident command systems and large-scale cloud infrastructure reviews.

Module 1: Defining Availability Requirements and SLIs

  • Selecting service-level indicators (SLIs) that reflect actual user-perceived availability, such as end-to-end request success rate over synthetic transaction latency.
  • Negotiating SLI measurement windows (e.g., 5-minute vs. 1-hour rollups) with product teams to balance sensitivity and operational noise.
  • Implementing blackbox probing for externally accessible endpoints while accounting for CDN and edge caching effects on availability signals.
  • Deciding whether to include partial degradation (e.g., degraded search functionality) as downtime in SLI calculations.
  • Calibrating error budget burn rate thresholds that trigger incident response without causing alert fatigue.
  • Documenting SLI calculation logic in code (e.g., Prometheus queries) to ensure auditability and consistency across teams.
  • Handling third-party dependency outages by defining whether they count against internal availability commitments.
  • Mapping business-critical user journeys to technical endpoints to prioritize monitoring coverage.

Module 2: High-Availability Architecture Design

  • Choosing between active-passive and active-active failover models based on RTO and RPO requirements for stateful services.
  • Designing regional failover strategies that account for DNS TTL limitations and cloud provider load balancer propagation delays.
  • Implementing distributed consensus algorithms (e.g., Raft) for metadata coordination in multi-region control planes.
  • Selecting quorum configurations in clustered databases to balance consistency, availability, and fault tolerance.
  • Validating cross-AZ routing and failover behavior in virtual private clouds through controlled network partition testing.
  • Architecting state replication mechanisms (e.g., async vs. sync replication) for session stores under bandwidth and latency constraints.
  • Integrating circuit breakers at service mesh level to prevent cascading failures during partial outages.
  • Evaluating cost-performance trade-offs of multi-cloud vs. multi-region deployment for critical workloads.

Module 3: Monitoring and Observability Implementation

  • Deploying redundant telemetry collectors to prevent monitoring blind spots during infrastructure failures.
  • Configuring multi-dimensional alerting rules that correlate metrics, logs, and traces to reduce false positives.
  • Setting up heartbeat monitoring for background job processors that do not serve HTTP traffic.
  • Instrumenting retry logic in clients to distinguish between transient and permanent failures in error rate calculations.
  • Storing and querying high-cardinality labels in time-series databases without degrading query performance.
  • Implementing log sampling strategies for high-volume services while preserving debuggability for rare errors.
  • Validating alert delivery paths through multiple channels (e.g., PagerDuty, SMS, backup email) during comms outages.
  • Using synthetic transactions to simulate user flows that are difficult to monitor via production traffic alone.

Module 4: Incident Response and Failover Execution

  • Executing DNS-based failover with pre-warmed endpoints to minimize recovery time during regional outages.
  • Validating failover runbooks under degraded conditions, such as partial control plane access.
  • Coordinating incident command structure handoffs during extended outages exceeding 12 hours.
  • Blocking automated deployments during active incidents to prevent compounding failures.
  • Rotating credentials and certificates post-incident to close potential attack vectors exposed during failover.
  • Managing communication with external stakeholders using templated status updates without disclosing sensitive architecture details.
  • Executing data reconciliation jobs after failback to resolve inconsistencies from async replication.
  • Enforcing change freeze windows following major incidents to stabilize the environment.

Module 5: Disaster Recovery Planning and Testing

  • Scheduling quarterly DR drills that simulate complete data center loss, including backup power and network egress failure.
  • Validating backup integrity by restoring production databases to isolated environments and verifying checksums.
  • Measuring actual RTO by timing full application stack recovery from backups, including dependency chains.
  • Managing encryption key escrow and access controls for offline backups in air-gapped storage.
  • Documenting manual intervention steps required when automated recovery tools are unavailable.
  • Testing cross-region IAM role replication and policy synchronization during DR activation.
  • Updating DR plans after major architectural changes, such as migration to serverless components.
  • Coordinating DR tests with dependent teams to avoid cascading impact on shared services.

Module 6: Change and Configuration Management

  • Enforcing canary deployment patterns with automated rollback triggers based on availability metrics.
  • Version-controlling infrastructure-as-code (IaC) templates and validating drift detection mechanisms.
  • Implementing approval gates for changes during high-risk periods, such as peak traffic seasons.
  • Using feature flags with kill switches to disable components without redeploying binaries.
  • Scanning configuration files for hardcoded secrets before merging into production pipelines.
  • Requiring peer review for changes to load balancer health check configurations due to their impact on traffic routing.
  • Archiving deprecated configuration variants to prevent accidental reuse in future deployments.
  • Validating configuration templates against schema rules before applying to multi-region environments.

Module 7: Capacity Planning and Scalability

  • Projecting resource needs based on historical growth trends and upcoming product launches.
  • Setting up predictive autoscaling using ML-driven forecasting models for seasonal traffic patterns.
  • Reserving capacity in secondary regions to handle failover workloads without performance degradation.
  • Monitoring queue depth and backlog growth in message brokers to anticipate scaling bottlenecks.
  • Conducting load tests with production-like data distributions to validate scaling assumptions.
  • Right-sizing instance families based on memory-to-CPU ratios observed in profiling data.
  • Managing cold start risks in serverless environments by provisioning concurrency limits and pre-warming strategies.
  • Tracking dependency saturation points, such as database connection pools, during scaling events.

Module 8: Governance and Compliance in Availability Management

  • Aligning availability controls with regulatory requirements (e.g., PCI-DSS, HIPAA) for data access during outages.
  • Documenting availability architecture decisions in system risk assessment reports for audit purposes.
  • Enforcing retention policies for incident logs and post-mortem records based on compliance mandates.
  • Classifying systems by business impact to prioritize availability investments and recovery order.
  • Implementing access controls for failover execution tools to meet segregation of duties requirements.
  • Reporting availability metrics to executive stakeholders using standardized templates that exclude sensitive details.
  • Updating business continuity plans to reflect changes in cloud provider dependencies and third-party services.
  • Conducting third-party assessments of vendor SLAs to validate claims about infrastructure redundancy.

Module 9: Post-Incident Analysis and System Hardening

  • Conducting blameless post-mortems with mandatory participation from all involved engineering teams.
  • Tracking action item completion from incident reviews using integrated project management tools.
  • Prioritizing remediation tasks based on recurrence likelihood and potential impact on availability.
  • Implementing automated tests that reproduce root cause conditions to prevent regression.
  • Updating monitoring dashboards to include signals that would have detected the incident earlier.
  • Introducing chaos engineering experiments targeting identified failure modes in staging environments.
  • Revising on-call playbooks with new diagnostic steps and escalation paths based on incident findings.
  • Measuring mean time to recovery (MTTR) improvements after implementing system hardening changes.