Skip to main content

Problem Management in Continual Service Improvement

$249.00
How you learn:
Self-paced • Lifetime updates
When you get access:
Course access is prepared after purchase and delivered via email
Toolkit Included:
Includes a practical, ready-to-use toolkit containing implementation templates, worksheets, checklists, and decision-support materials used to accelerate real-world application and reduce setup time.
Who trusts this:
Trusted by professionals in 160+ countries
Your guarantee:
30-day money-back guarantee — no questions asked
Adding to cart… The item has been added

This curriculum spans the design and operational governance of a problem management function, comparable in scope to a multi-workshop internal capability program that aligns cross-functional teams on incident correlation, root cause analysis, and integration with change and continual service improvement processes.

Module 1: Defining Problem Management within the Service Lifecycle

  • Determine whether a known error database (KEDB) will be maintained separately from the configuration management system (CMS) or integrated, based on tooling constraints and change control workflows.
  • Establish decision criteria for escalating incidents to problem records, including frequency thresholds, business impact scoring, and service level agreement (SLA) breach risks.
  • Define ownership boundaries between incident and problem management teams when root cause analysis overlaps with immediate service restoration efforts.
  • Align problem prioritization with business service criticality by mapping CI dependencies to business processes in the service portfolio.
  • Negotiate integration points between problem management and change advisory board (CAB) processes to ensure permanent fixes undergo proper risk assessment.
  • Decide whether reactive problem management will trigger automatically from incident clustering rules or require manual initiation by service desk supervisors.

Module 2: Problem Identification and Categorization

  • Implement automated correlation rules in event management tools to detect incident spikes across CIs, requiring tuning to avoid false positives from scheduled maintenance.
  • Design a hierarchical categorization model that supports both technical domains (e.g., network, database) and business services to enable multidimensional reporting.
  • Select whether to use symptom-based or root cause-based categorization during initial logging, balancing speed of entry against long-term data accuracy.
  • Configure CI impact mapping in the CMS so that recurring incidents on shared infrastructure trigger proactive problem identification.
  • Integrate application performance monitoring (APM) data into problem intake to detect performance degradation patterns not captured in incident tickets.
  • Define thresholds for identifying major problems based on financial impact, customer count affected, or regulatory exposure, requiring input from business relationship managers.

Module 3: Root Cause Analysis Techniques and Application

  • Choose between Ishikawa diagrams and 5 Whys for RCA facilitation based on problem complexity and stakeholder technical literacy.
  • Conduct timeline analysis for outages by synchronizing logs from distributed systems, requiring coordination with infrastructure and security teams for access.
  • Decide whether to involve vendor support teams in RCA workshops, considering contractual obligations and data confidentiality agreements.
  • Apply Pareto analysis to historical incident data to focus RCA efforts on the 20% of problems causing 80% of service disruptions.
  • Document RCA findings in a standardized format that links evidence to conclusions, enabling auditability and knowledge reuse.
  • Balance depth of analysis against business urgency when determining how many contributing factors to investigate in a time-constrained environment.

Module 4: Managing the Known Error Database (KEDB)

  • Define ownership for KEDB updates, specifying whether problem managers, technical leads, or configuration analysts are responsible for entry accuracy.
  • Establish synchronization protocols between the KEDB and knowledge management system to ensure workarounds are available to service desk personnel.
  • Implement version control for KEDB entries when multiple proposed fixes exist for the same known error, preventing conflicting guidance.
  • Enforce review cycles for KEDB entries tied to unresolved problems, triggering revalidation after 30, 60, and 90 days of inactivity.
  • Integrate KEDB search functionality into incident ticketing tools to prompt service desk agents with known workarounds during triage.
  • Restrict KEDB access based on role, ensuring that only authorized personnel can modify or retire entries linked to active changes.

Module 5: Problem Resolution and Change Implementation

  • Require problem records to reference associated change requests, ensuring traceability from diagnosis to deployment of fixes.
  • Assess whether a permanent fix requires standard, normal, or emergency change handling based on risk, impact, and urgency criteria.
  • Coordinate regression testing plans with application owners when deploying fixes that affect shared services or integrations.
  • Document rollback procedures for problem resolutions that introduce new code or configuration changes to production environments.
  • Verify that change success metrics (e.g., incident reduction, MTTR improvement) are captured post-implementation to validate resolution effectiveness.
  • Negotiate change freeze exceptions for critical problem resolutions during scheduled maintenance windows, requiring CAB escalation.

Module 6: Metrics, Reporting, and Performance Monitoring

  • Track mean time to resolve (MTTR) for problems segmented by priority, identifying bottlenecks in analysis or change approval stages.
  • Measure the percentage of incidents resolved using known errors, using the metric to assess KEDB utilization and knowledge transfer efficacy.
  • Report on problem backlog aging to identify stalled investigations requiring management intervention or resource reallocation.
  • Calculate cost of downtime per unresolved major problem to justify investment in root cause analysis resources.
  • Compare proactive versus reactive problem identification rates to evaluate the maturity of monitoring and trend analysis capabilities.
  • Align problem management KPIs with continual service improvement (CSI) goals, ensuring metrics drive actionable insights rather than compliance reporting.

Module 7: Integration with Continual Service Improvement (CSI)

  • Feed problem trends into CSI registers to prioritize improvement initiatives based on recurring failure patterns.
  • Use problem data to validate baseline measurements during CSI phase two (Where are we now?) for service availability and reliability.
  • Link resolved problems to process improvements in change, release, or capacity management to close feedback loops.
  • Conduct quarterly service reviews using problem reports to demonstrate progress on reducing chronic incidents.
  • Update service design packages with lessons learned from major problem investigations to prevent recurrence in new implementations.
  • Embed problem-derived improvement actions into the seven-step improvement process, ensuring ownership and timeline accountability.

Module 8: Governance, Roles, and Cross-Functional Coordination

  • Define escalation paths for problems that exceed resolution time targets, specifying intervention points for service owners and technical directors.
  • Assign problem managers with cross-domain authority to coordinate RCA across siloed technical teams without direct reporting lines.
  • Establish service-level requirements for problem management in operational level agreements (OLAs) with support groups.
  • Conduct problem review meetings with application, infrastructure, and security stakeholders to validate findings and assign action items.
  • Enforce audit trails for problem record modifications to support compliance requirements in regulated environments.
  • Balance centralized oversight with decentralized execution by defining which decisions require problem management board approval versus team-level discretion.