Description

This curriculum spans the technical and procedural rigor of a multi-workshop infrastructure modernization program, covering the same depth of operational discipline applied in enterprise-scale cloud migrations and internal platform engineering initiatives.

Module 1: Infrastructure Planning and Environment Segregation

Selecting appropriate server instance types based on application load profiles and cost-performance trade-offs across development, staging, and production environments.
Designing network segmentation to isolate environments while enabling controlled data flow for testing and monitoring.
Implementing consistent naming conventions and tagging strategies for cloud resources to support billing, automation, and audit compliance.
Deciding between containerized versus bare-metal deployments for latency-sensitive or compliance-bound applications.
Establishing quotas and limits on non-production environments to prevent resource sprawl and cost overruns.
Integrating environment provisioning into CI/CD pipelines while maintaining manual approval gates for production deployment.
Documenting hardware and software dependencies for legacy integrations that constrain infrastructure choices.
Planning for regional failover by replicating critical environments in geographically distributed zones.

Module 2: Configuration Management and Infrastructure as Code

Choosing between declarative (e.g., Terraform) and imperative (e.g., Ansible) tools based on team expertise and rollback requirements.
Versioning configuration templates and aligning them with application release cycles to prevent drift.
Managing secrets in IaC pipelines using dedicated vault integrations instead of environment variables or config files.
Enforcing peer review of infrastructure changes through pull request workflows in version control.
Handling state file management for Terraform in team environments using remote backends with locking mechanisms.
Automating drift detection and reconciliation for production servers that may have been modified outside of IaC.
Standardizing OS images using Packer to reduce configuration variance across server fleets.
Defining reusable modules for common components like load balancers or database proxies to ensure consistency.

Module 3: Patch Management and System Updates

Scheduling maintenance windows for patching that minimize disruption to user-facing services and third-party integrations.
Testing security patches in a mirrored staging environment before rollout to production.
Implementing automated patch compliance reporting to meet internal audit and regulatory requirements.
Choosing between rolling updates and blue-green replacement for minimizing downtime during kernel or library upgrades.
Managing dependencies when updating system libraries that may break application compatibility.
Configuring unattended-upgrades selectively to apply only security patches on critical systems.
Documenting rollback procedures for failed updates, including snapshot restoration and configuration reversion.
Coordinating with development teams to avoid patching during active deployment cycles.

Module 4: Monitoring, Logging, and Alerting

Defining baseline performance metrics (CPU, memory, disk I/O) for each service to detect anomalies.
Configuring log retention policies that balance storage costs with forensic investigation needs.
Filtering and parsing application logs to extract structured error events for alerting systems.
Setting threshold-based alerts with hysteresis to reduce false positives from transient spikes.
Integrating APM tools with custom instrumentation to trace latency across microservices.
Centralizing logs using secure transport (e.g., TLS) from distributed servers to a SIEM platform.
Creating service-level dashboards that reflect business KPIs rather than just technical metrics.
Validating alert delivery paths (SMS, email, PagerDuty) through periodic test incidents.

Module 5: Backup, Recovery, and Disaster Planning

Classifying data by recovery point and recovery time objectives to determine backup frequency and method.
Encrypting backup data at rest and in transit, with access restricted to authorized personnel only.
Testing full-system recovery from backups in an isolated environment at least quarterly.
Storing backups in geographically separate regions to mitigate data center outages.
Automating backup validation by checksum verification and file integrity checks.
Documenting recovery runbooks with step-by-step procedures for different failure scenarios.
Coordinating backup schedules to avoid resource contention during peak application usage.
Managing retention periods for backups in accordance with data privacy regulations (e.g., GDPR, HIPAA).

Module 6: Security Hardening and Access Control

Disabling root SSH access and enforcing key-based authentication with multi-factor for administrative access.
Implementing role-based access control (RBAC) for server management tools based on job function.
Applying the principle of least privilege when granting sudo permissions to developers and operators.
Configuring host-based firewalls to restrict inbound and outbound traffic to required ports only.
Rotating SSH keys and service account credentials on a scheduled basis or after personnel changes.
Conducting regular vulnerability scans and remediating findings based on severity and exploitability.
Enabling audit logging for privileged commands and monitoring for anomalous behavior.
Hardening OS configurations using CIS benchmarks or internal security baselines.

Module 7: Performance Tuning and Resource Optimization

Profiling application bottlenecks using system tools (e.g., top, iostat, netstat) to guide infrastructure adjustments.
Adjusting kernel parameters (e.g., file descriptor limits, TCP buffer sizes) for high-throughput services.
Right-sizing virtual machines based on actual utilization trends rather than peak theoretical loads.
Implementing caching layers (e.g., Redis, Varnish) to reduce backend server load and response latency.
Monitoring swap usage to detect memory pressure and prevent performance degradation.
Optimizing database connection pooling to prevent exhaustion of server resources.
Using autoscaling policies based on custom metrics rather than generic CPU thresholds.
Identifying and terminating orphaned processes or zombie containers consuming system resources.

Module 8: Change Management and Operational Governance

Requiring change tickets for all production modifications, including emergency fixes with post-incident review.
Conducting blameless postmortems for outages to identify systemic issues and prevent recurrence.
Maintaining an up-to-date configuration management database (CMDB) to track server ownership and purpose.
Enforcing change freeze periods during critical business cycles (e.g., end-of-quarter, Black Friday).
Requiring dual approval for high-risk operations such as database migrations or firewall rule changes.
Archiving decommissioned servers only after confirming all dependencies have been removed.
Standardizing incident response workflows with escalation paths and communication protocols.
Conducting regular operational reviews to assess tooling effectiveness and process adherence.

Module 9: Integration with Development Workflows

Providing developers with self-service tools to spin up isolated test environments with production-like configurations.
Enforcing pre-deployment checks in CI pipelines for configuration compliance and security scanning.
Instrumenting servers to expose health endpoints consumed by Kubernetes liveness and readiness probes.
Collaborating with developers to define resource requests and limits in container manifests.
Sharing performance and error metrics with development teams to inform optimization efforts.
Documenting server dependencies and API contracts to prevent breaking changes during maintenance.
Automating the cleanup of ephemeral environments to prevent resource leakage.
Establishing feedback loops between operations and development to refine observability requirements.