Description

This curriculum spans the design and operational governance of a problem management function, comparable in scope to a multi-workshop internal capability program that aligns cross-functional teams on incident correlation, root cause analysis, and integration with change and continual service improvement processes.

Module 1: Defining Problem Management within the Service Lifecycle

Determine whether a known error database (KEDB) will be maintained separately from the configuration management system (CMS) or integrated, based on tooling constraints and change control workflows.
Establish decision criteria for escalating incidents to problem records, including frequency thresholds, business impact scoring, and service level agreement (SLA) breach risks.
Define ownership boundaries between incident and problem management teams when root cause analysis overlaps with immediate service restoration efforts.
Align problem prioritization with business service criticality by mapping CI dependencies to business processes in the service portfolio.
Negotiate integration points between problem management and change advisory board (CAB) processes to ensure permanent fixes undergo proper risk assessment.
Decide whether reactive problem management will trigger automatically from incident clustering rules or require manual initiation by service desk supervisors.

Module 2: Problem Identification and Categorization

Implement automated correlation rules in event management tools to detect incident spikes across CIs, requiring tuning to avoid false positives from scheduled maintenance.
Design a hierarchical categorization model that supports both technical domains (e.g., network, database) and business services to enable multidimensional reporting.
Select whether to use symptom-based or root cause-based categorization during initial logging, balancing speed of entry against long-term data accuracy.
Configure CI impact mapping in the CMS so that recurring incidents on shared infrastructure trigger proactive problem identification.
Integrate application performance monitoring (APM) data into problem intake to detect performance degradation patterns not captured in incident tickets.
Define thresholds for identifying major problems based on financial impact, customer count affected, or regulatory exposure, requiring input from business relationship managers.

Module 3: Root Cause Analysis Techniques and Application

Choose between Ishikawa diagrams and 5 Whys for RCA facilitation based on problem complexity and stakeholder technical literacy.
Conduct timeline analysis for outages by synchronizing logs from distributed systems, requiring coordination with infrastructure and security teams for access.
Decide whether to involve vendor support teams in RCA workshops, considering contractual obligations and data confidentiality agreements.
Apply Pareto analysis to historical incident data to focus RCA efforts on the 20% of problems causing 80% of service disruptions.
Document RCA findings in a standardized format that links evidence to conclusions, enabling auditability and knowledge reuse.
Balance depth of analysis against business urgency when determining how many contributing factors to investigate in a time-constrained environment.

Module 4: Managing the Known Error Database (KEDB)

Define ownership for KEDB updates, specifying whether problem managers, technical leads, or configuration analysts are responsible for entry accuracy.
Establish synchronization protocols between the KEDB and knowledge management system to ensure workarounds are available to service desk personnel.
Implement version control for KEDB entries when multiple proposed fixes exist for the same known error, preventing conflicting guidance.
Enforce review cycles for KEDB entries tied to unresolved problems, triggering revalidation after 30, 60, and 90 days of inactivity.
Integrate KEDB search functionality into incident ticketing tools to prompt service desk agents with known workarounds during triage.
Restrict KEDB access based on role, ensuring that only authorized personnel can modify or retire entries linked to active changes.

Module 5: Problem Resolution and Change Implementation

Require problem records to reference associated change requests, ensuring traceability from diagnosis to deployment of fixes.
Assess whether a permanent fix requires standard, normal, or emergency change handling based on risk, impact, and urgency criteria.
Coordinate regression testing plans with application owners when deploying fixes that affect shared services or integrations.
Document rollback procedures for problem resolutions that introduce new code or configuration changes to production environments.
Verify that change success metrics (e.g., incident reduction, MTTR improvement) are captured post-implementation to validate resolution effectiveness.
Negotiate change freeze exceptions for critical problem resolutions during scheduled maintenance windows, requiring CAB escalation.

Module 6: Metrics, Reporting, and Performance Monitoring

Track mean time to resolve (MTTR) for problems segmented by priority, identifying bottlenecks in analysis or change approval stages.
Measure the percentage of incidents resolved using known errors, using the metric to assess KEDB utilization and knowledge transfer efficacy.
Report on problem backlog aging to identify stalled investigations requiring management intervention or resource reallocation.
Calculate cost of downtime per unresolved major problem to justify investment in root cause analysis resources.
Compare proactive versus reactive problem identification rates to evaluate the maturity of monitoring and trend analysis capabilities.
Align problem management KPIs with continual service improvement (CSI) goals, ensuring metrics drive actionable insights rather than compliance reporting.

Module 7: Integration with Continual Service Improvement (CSI)

Feed problem trends into CSI registers to prioritize improvement initiatives based on recurring failure patterns.
Use problem data to validate baseline measurements during CSI phase two (Where are we now?) for service availability and reliability.
Link resolved problems to process improvements in change, release, or capacity management to close feedback loops.
Conduct quarterly service reviews using problem reports to demonstrate progress on reducing chronic incidents.
Update service design packages with lessons learned from major problem investigations to prevent recurrence in new implementations.
Embed problem-derived improvement actions into the seven-step improvement process, ensuring ownership and timeline accountability.

Module 8: Governance, Roles, and Cross-Functional Coordination

Define escalation paths for problems that exceed resolution time targets, specifying intervention points for service owners and technical directors.
Assign problem managers with cross-domain authority to coordinate RCA across siloed technical teams without direct reporting lines.
Establish service-level requirements for problem management in operational level agreements (OLAs) with support groups.
Conduct problem review meetings with application, infrastructure, and security stakeholders to validate findings and assign action items.
Enforce audit trails for problem record modifications to support compliance requirements in regulated environments.
Balance centralized oversight with decentralized execution by defining which decisions require problem management board approval versus team-level discretion.