This curriculum spans the technical and procedural rigor of a multi-phase network operations readiness program, comparable to internal capability builds for global infrastructure teams managing hybrid environments at scale.
Module 1: Network Infrastructure Assessment and Discovery
- Conduct automated network device discovery using SNMP and CLI-based polling to identify all Layer 2 and Layer 3 assets, including unmanaged switches.
- Evaluate the accuracy of network topology maps by reconciling data from NetFlow, CDP, and LLDP against configuration management databases (CMDB).
- Identify shadow IT devices by analyzing DHCP logs and MAC address vendor prefixes across access layer switches.
- Assess firmware versions across heterogeneous vendor environments (Cisco, Juniper, Arista) to determine patch compliance and end-of-support risks.
- Map application dependency flows using packet capture and flow analysis to uncover undocumented interdependencies between business services.
- Document physical and logical network segmentation, including VLANs, VRFs, and firewall zones, to support compliance audit requirements.
Module 2: Network Monitoring and Performance Management
- Configure threshold-based alerts for interface utilization, error rates, and latency using time-series databases like InfluxDB or Prometheus.
- Implement synthetic transaction monitoring for critical applications (e.g., VoIP, ERP) to simulate user experience and detect degradation before user impact.
- Integrate network performance data with APM tools to correlate infrastructure metrics with application response times.
- Deploy distributed packet capture points at key network aggregation layers to support forensic analysis without overloading central systems.
- Adjust polling intervals for SNMP devices based on device type and criticality to balance monitoring granularity with resource utilization.
- Validate the reliability of BGP and OSPF neighbor states through continuous reachability checks and event-triggered diagnostics.
Module 3: Configuration Management and Change Control
- Enforce configuration drift detection by scheduling automated diffs between running and baseline configurations using tools like RANCID or Oxidized.
- Implement role-based access control (RBAC) for configuration changes, ensuring separation between network operators and approvers.
- Integrate configuration templates with version control systems (e.g., Git) to track changes, support rollbacks, and audit modifications.
- Standardize interface descriptions and VLAN naming conventions across all switches to ensure operational consistency.
- Automate pre-change validation checks, such as verifying available ACL space before firewall rule insertion.
- Enforce change freeze windows during critical business periods and coordinate with change advisory boards (CAB).
Module 4: Fault Management and Incident Response
- Normalize and deduplicate syslog messages from multi-vendor devices using structured parsing and severity remapping.
- Route high-severity network alerts to on-call engineers via escalation policies in incident management platforms like PagerDuty.
- Develop runbooks for common failure scenarios, such as BGP session flaps or STP topology changes, with predefined diagnostic steps.
- Correlate device-level faults with environmental data (e.g., power, temperature) from DCIM systems to identify root causes.
- Implement automated suppression of alerts during planned maintenance to reduce noise in monitoring systems.
- Conduct post-incident reviews to update monitoring thresholds and detection logic based on actual failure patterns.
Module 5: Capacity Planning and Bandwidth Optimization
- Forecast bandwidth demand for WAN links by analyzing 95th percentile utilization trends over six-month intervals.
- Identify underutilized circuits for potential decommissioning or repurposing based on sustained low usage metrics.
- Model the impact of new applications (e.g., video conferencing) on LAN and WAN capacity using traffic simulation tools.
- Implement QoS policies to prioritize real-time traffic and enforce rate limits on non-critical applications.
- Optimize MPLS circuit utilization by redistributing traffic across multiple paths using TE tunnels or SD-WAN.
- Validate capacity models against actual traffic growth to refine forecasting algorithms and assumptions.
Module 6: Security Integration and Network Access Control
- Enforce 802.1X authentication on access switches and integrate with RADIUS servers for dynamic VLAN assignment.
- Automatically quarantine devices with outdated OS patches or missing AV by integrating NAC with endpoint management systems.
- Deploy micro-segmentation policies using group-based policies (e.g., Cisco SDA) to limit lateral movement.
- Sync firewall rule change requests with network change management systems to ensure audit trail consistency.
- Monitor for unauthorized DHCP servers by enabling DHCP snooping and logging violations on access switches.
- Integrate threat intelligence feeds with perimeter firewalls to dynamically block known malicious IP addresses.
Module 7: Automation and Orchestration Frameworks
- Develop Ansible playbooks to automate firmware upgrades across switch stacks with built-in health checks and rollback procedures.
- Use Python scripts with Netmiko or NAPALM to extract interface statistics and generate custom reports not supported by commercial tools.
- Orchestrate VLAN provisioning across switches, firewalls, and DHCP servers using workflow engines like StackStorm.
- Implement REST API integrations between network controllers and cloud provisioning platforms (e.g., VMware, OpenStack).
- Validate automation scripts in a lab environment that mirrors production topology and device versions.
- Log all automated actions in a centralized audit repository with timestamps, user context, and change details.
Module 8: Governance, Compliance, and Reporting
- Generate monthly compliance reports for SOX or HIPAA that detail configuration change history and access logs for network devices.
- Enforce encryption of management traffic (SSH, HTTPS) and disable legacy protocols (Telnet, SNMPv1) across all devices.
- Conduct quarterly access reviews to remove obsolete user accounts and privilege escalations from network systems.
- Archive network configuration backups for seven years to meet regulatory retention requirements.
- Validate backup integrity by performing periodic restore tests on critical routers and firewalls.
- Report network availability SLAs to stakeholders using uptime data derived from monitoring systems and incident records.