This curriculum spans the technical, operational, and cross-functional coordination challenges involved in establishing and maintaining service level management practices across large-scale, multi-team environments, comparable in scope to a multi-workshop program addressing SRE governance, incident response integration, vendor oversight, and regulatory alignment.
Module 1: Defining and Aligning Service Level Objectives (SLOs)
- Selecting appropriate SLOs based on business-critical transaction paths rather than infrastructure metrics alone
- Negotiating SLO thresholds with business units when historical performance data shows current targets are unattainable
- Deciding whether to use rolling time windows (e.g., 28-day) or calendar-based (e.g., monthly) SLO evaluation periods
- Handling conflicting SLO requirements between internal IT operations and external customer expectations
- Documenting SLO ownership and accountability across service delivery teams in multi-vendor environments
- Adjusting SLOs during planned system migrations or major version upgrades with temporary performance impacts
Module 2: Designing and Instrumenting Service Level Indicators (SLIs)
- Choosing between synthetic monitoring and real user monitoring (RUM) for measuring availability SLIs
- Implementing consistent SLI calculation logic across microservices with different technology stacks
- Configuring logging and telemetry pipelines to capture sufficient data for SLI computation without violating privacy policies
- Defining error budgets for SLIs that include both technical failures and business logic exceptions
- Calibrating latency SLIs using percentiles (e.g., p95) while justifying the choice to stakeholders
- Validating SLI accuracy by cross-referencing monitoring tools during incident postmortems
Module 3: Establishing Error Budget Policies and Governance
- Setting error budget consumption thresholds that trigger mandatory change freezes or architecture reviews
- Deciding whether to pool error budgets across related services or maintain service-specific budgets
- Handling exceptions when product teams exceed error budgets due to externally mandated compliance changes
- Designing escalation paths when SRE and development teams disagree on error budget accountability
- Adjusting error budget calculations during seasonal traffic spikes or marketing campaigns
- Archiving and auditing error budget usage for regulatory or internal audit requirements
Module 4: Integrating SLAs with Incident Management
- Configuring incident severity levels to align with SLA breach timelines and escalation procedures
- Automating SLA breach notifications to legal and customer success teams when thresholds are crossed
- Documenting incident timelines to determine whether SLA credits apply under contract terms
- Coordinating incident response activities with SLA clock management during extended outages
- Reconciling internal SLOs with externally reported SLA metrics when measurement methodologies differ
- Updating runbooks to include SLA-specific communication templates for customer-facing teams
Module 5: Reporting and Dashboarding for Stakeholder Transparency
- Designing executive dashboards that show SLO compliance without exposing sensitive operational details
- Scheduling automated SLA performance reports for legal, finance, and customer support departments
- Handling discrepancies between real-time dashboards and finalized monthly SLA reports due to data latency
- Implementing role-based access controls on SLO dashboards in shared monitoring platforms
- Selecting visualization formats that accurately represent error budget consumption trends over time
- Archiving historical SLO reports to support contract renewals and vendor performance evaluations
Module 6: Managing Third-Party and Vendor SLAs
- Mapping internal SLOs to upstream vendor SLAs to identify coverage gaps and single points of failure
- Negotiating penalty clauses and credit terms in vendor contracts based on measurable SLA breaches
- Implementing independent monitoring to validate vendor-reported uptime and performance claims
- Coordinating incident investigations with third-party providers when root cause spans organizational boundaries
- Updating internal risk assessments when a critical vendor consistently operates near SLA thresholds
- Documenting fallback procedures when vendor SLAs do not support required business continuity objectives
Module 7: Continuous Improvement and SLO Maturity Assessment
- Conducting SLO health reviews to retire outdated objectives that no longer reflect business priorities
- Identifying services lacking SLOs due to legacy status or undocumented ownership
- Implementing feedback loops from customer support tickets to refine SLI definitions
- Assessing team SLO compliance rates to inform capacity planning and staffing decisions
- Standardizing SLO templates and review cycles across business units to reduce governance overhead
- Using SLO performance trends to prioritize technical debt reduction and platform modernization efforts
Module 8: Legal, Financial, and Regulatory Implications of SLA Management
- Coordinating with legal teams to ensure SLA terms are enforceable and align with jurisdictional requirements
- Calculating and processing SLA credits consistently across multiple customer contracts and billing systems
- Preparing documentation for auditors demonstrating compliance with industry-specific uptime requirements
- Managing disclosure of SLA breach data in public financial reports or regulatory filings
- Updating insurance policies to reflect SLA-related financial exposure from service credits
- Reconciling internal SLO practices with contractual SLAs during merger and acquisition due diligence