This curriculum spans the technical and organisational complexity of a multi-workshop performance engineering engagement, addressing the same instrumentation, correlation, and diagnostic challenges faced when conducting root-cause analysis across distributed systems in large-scale production environments.
Module 1: Defining Performance Baselines and Thresholds
- Selecting appropriate KPIs for system responsiveness, throughput, and error rates based on business-critical transaction types.
- Establishing dynamic thresholds using statistical process control instead of static limits to reduce false alerts during normal load fluctuations.
- Deciding whether to baseline at the infrastructure, application, or business transaction level based on observability scope and tooling constraints.
- Integrating historical performance data from multiple environments to account for seasonal usage patterns before setting thresholds.
- Documenting and versioning baseline definitions to support auditability and reproducibility during incident reviews.
- Resolving conflicts between development, operations, and business stakeholders on what constitutes acceptable performance under load.
Module 2: Instrumentation Strategy and Data Collection
- Choosing between agent-based, API-injected, or network tap monitoring based on application architecture and security requirements.
- Configuring sampling rates for distributed tracing to balance data fidelity with storage costs and performance overhead.
- Mapping custom business transaction identifiers across microservices to maintain end-to-end traceability in polyglot environments.
- Implementing secure credential handling for monitoring agents accessing production databases and message queues.
- Validating timestamp synchronization across distributed systems to ensure accurate event correlation.
- Negotiating data retention policies with legal and compliance teams for performance telemetry containing PII.
Module 3: Correlation of Multi-layer Telemetry
- Aligning log timestamps with APM traces and infrastructure metrics using a centralized time source and log ingestion pipeline.
- Building correlation IDs that propagate across service boundaries, message brokers, and batch processes for unified diagnostics.
- Using dependency mapping tools to identify indirect service relationships that contribute to latency but are not directly invoked.
- Filtering noise in correlated datasets by excluding health check traffic and synthetic monitoring probes from analysis.
- Resolving discrepancies between application-reported durations and network-level latency measurements during triage.
- Automating correlation rule updates when new services or integration points are deployed into production.
Module 4: Diagnosing Resource Contention and Bottlenecks
- Differentiating between CPU saturation caused by application logic versus garbage collection cycles in JVM-based systems.
- Interpreting memory pressure indicators across containers and host systems to isolate noisy neighbor issues in shared clusters.
- Assessing disk I/O latency at the hypervisor, storage array, and filesystem layers to pinpoint storage subsystem bottlenecks.
- Identifying thread pool exhaustion in application servers by correlating thread dumps with request queue metrics.
- Measuring network round-trip times across zones and regions to evaluate impact on distributed transaction performance.
- Validating whether connection pooling configurations match actual concurrency demands under peak load conditions.
Module 5: Root-Cause Validation and Hypothesis Testing
- Designing controlled production experiments using feature flags to isolate the impact of specific code paths on performance.
- Executing canary rollbacks to verify whether a recent deployment correlates with observed degradation patterns.
- Using statistical hypothesis testing to determine whether performance changes are significant or within normal variance.
- Reproducing production bottlenecks in staging environments using production-like data volumes and access patterns.
- Comparing pre- and post-incident profiles using flame graphs to visually identify new hot code paths.
- Documenting assumptions and evidence for each eliminated hypothesis to prevent recurrence of diagnostic errors.
Module 6: Change Impact Analysis and Configuration Drift
- Linking performance incidents to configuration management databases to assess recent changes in middleware settings.
- Reviewing auto-scaling policy adjustments that may have triggered resource oscillation under variable load.
- Investigating DNS or service discovery changes that result in suboptimal routing and increased latency.
- Validating that database index rebuilds or statistics updates were completed before attributing slowness to query plans.
- Tracking third-party API version upgrades that introduce unexpected payload size or rate limiting behavior.
- Reconciling deployment timing with performance degradation onset using immutable artifact identifiers and CI/CD logs.
Module 7: Post-Incident Review and Feedback Loops
- Extracting actionable metrics from incident timelines to measure detection, diagnosis, and resolution durations.
- Updating monitoring dashboards and alerting rules based on gaps identified during recent root-cause investigations.
- Integrating performance anti-patterns discovered in incidents into pre-deployment static analysis pipelines.
- Adjusting synthetic transaction scripts to reflect real user journeys that previously lacked coverage.
- Standardizing runbook updates to include performance-specific triage steps derived from recent incidents.
- Facilitating cross-team workshops to align SRE, development, and database administration on recurring performance failure modes.
Module 8: Scaling Analysis Across Complex Environments
- Partitioning analysis by tenant or business unit in multi-tenant systems to isolate localized performance issues.
- Aggregating performance signals across geographically distributed instances while preserving regional specificity.
- Managing tool sprawl by consolidating findings from APM, infrastructure monitoring, and custom logging systems into a unified view.
- Implementing role-based data access controls in analysis platforms to comply with least-privilege security policies.
- Optimizing query performance on large telemetry datasets using indexing strategies and pre-aggregated rollups.
- Automating anomaly detection model retraining to adapt to architectural changes such as service decomposition or data sharding.