Configuring Fault Correlation
This tutorial shows how to configure fault correlation to automatically identify root causes and reduce fault noise in complex systems.
Overview
When multiple faults occur in rapid succession, they often share a common root cause. Fault correlation helps by:
Hierarchical mode: Identifying root cause → symptom relationships. Symptoms are muted while the root cause is displayed.
Auto-cluster mode: Grouping similar faults that occur within a time window into a single cluster.
This is similar to:
AUTOSAR DEM event combination - grouping related diagnostic events
AIOps alert correlation - reducing alert storms in monitoring systems
Benefits:
Reduced fault noise for operators
Faster root cause identification
Cleaner fault lists in dashboards
How It Works
Hierarchical Mode Flow:
Auto-Cluster Mode Flow:
Quick Start
Create a correlation configuration file:
# correlation.yaml correlation: enabled: true default_window_ms: 500 patterns: motor_errors: codes: ["MOTOR_*"] sensor_errors: codes: ["SENSOR_*"] rules: - id: estop_cascade name: "E-Stop Cascade" mode: hierarchical root_cause: codes: ["ESTOP_001"] symptoms: - pattern: motor_errors window_ms: 2000 mute_symptoms: true auto_clear_with_root: true
Start the fault manager with correlation enabled:
ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \ -p correlation.config_file:=/path/to/correlation.yaml
Query faults with correlation data:
curl "http://localhost:8080/api/v1/faults?include_muted=true"
Configuration Reference
Top-Level Settings
Parameter |
Default |
Description |
|---|---|---|
|
|
Enable/disable fault correlation |
|
|
Default time window for correlation rules (ms) |
Patterns Section
Patterns define reusable groups of fault codes with wildcard support:
patterns:
motor_errors:
codes: ["MOTOR_COMM_*", "MOTOR_TIMEOUT_*"]
drive_faults:
codes: ["DRIVE_*", "INVERTER_*"]
sensor_errors:
codes: ["SENSOR_*"]
Wildcard syntax:
*matches any sequence of charactersMOTOR_*matchesMOTOR_001,MOTOR_OVERHEAT, etc.Exact codes (without
*) use fast string comparison
Hierarchical Mode
Hierarchical mode identifies root cause → symptom relationships.
rules:
- id: estop_cascade
name: "E-Stop Cascade"
mode: hierarchical
root_cause:
codes: ["ESTOP_001", "ESTOP_002"]
symptoms:
- pattern: motor_errors
- pattern: drive_faults
window_ms: 2000
mute_symptoms: true
auto_clear_with_root: true
Parameter |
Default |
Description |
|---|---|---|
|
(required) |
Fault codes that trigger this rule |
|
(required) |
List of pattern references for symptom faults |
|
|
Time window to look for symptoms after root cause |
|
|
Hide symptom faults from normal queries |
|
|
Auto-clear symptoms when root cause is cleared |
How it works:
When
ESTOP_001is reported, it’s marked as a root causeAny
MOTOR_*orDRIVE_*faults within 2000ms are marked as symptomsSymptoms are muted (not shown in default fault list)
Clearing
ESTOP_001auto-clears all its symptoms
Auto-Cluster Mode
Auto-cluster mode groups similar faults into clusters.
rules:
- id: sensor_storm
name: "Sensor Storm"
mode: auto_cluster
match:
- pattern: sensor_errors
min_count: 3
window_ms: 2000
show_as_single: true
representative: highest_severity
Parameter |
Default |
Description |
|---|---|---|
|
(required) |
List of pattern references for faults to cluster |
|
|
Minimum faults needed to form a cluster |
|
|
Time window for clustering faults |
|
|
Show cluster as single fault (mute non-representatives) |
|
|
How to select the cluster representative:
|
How it works:
First
SENSOR_*fault starts a potential clusterAdditional
SENSOR_*faults within 2000ms join the clusterWhen 3+ faults accumulate, cluster becomes active
Only the representative fault is shown; others are muted
Querying Correlation Data
Basic query (includes counts):
curl http://localhost:8080/api/v1/faults
Response always includes:
{
"faults": [...], // fault objects
"count": 5,
"muted_count": 12,
"cluster_count": 2
}
With muted fault details:
curl "http://localhost:8080/api/v1/faults?include_muted=true"
{
"faults": [...], // fault objects
"muted_count": 12,
"muted_faults": [
{
"fault_code": "MOTOR_COMM_001",
"root_cause_code": "ESTOP_001",
"rule_id": "estop_cascade",
"delay_ms": 150
}
]
}
With cluster details:
curl "http://localhost:8080/api/v1/faults?include_clusters=true"
{
"faults": [...], // fault objects
"cluster_count": 2,
"clusters": [
{
"cluster_id": "sensor_storm_1",
"rule_id": "sensor_storm",
"rule_name": "Sensor Storm",
"representative_code": "SENSOR_CRITICAL_001",
"representative_severity": "CRITICAL",
"fault_codes": ["SENSOR_001", "SENSOR_002", "SENSOR_CRITICAL_001"],
"first_at": "2026-01-19T10:00:00Z",
"last_at": "2026-01-19T10:00:01Z"
}
]
}
Auto-clear on root cause resolution:
curl -X DELETE http://localhost:8080/api/v1/faults/ESTOP_001
{
"success": true,
"fault_code": "ESTOP_001",
"auto_cleared_codes": ["MOTOR_COMM_001", "MOTOR_TIMEOUT_002", "DRIVE_001"]
}
Example: Complete Configuration
# /etc/ros2_medkit/correlation.yaml
correlation:
enabled: true
default_window_ms: 500
patterns:
motor_errors:
codes: ["MOTOR_COMM_*", "MOTOR_TIMEOUT_*", "MOTOR_OVERHEAT_*"]
drive_faults:
codes: ["DRIVE_*", "INVERTER_*"]
sensor_errors:
codes: ["SENSOR_*"]
battery_warnings:
codes: ["BATTERY_LOW", "BATTERY_CRITICAL"]
rules:
# E-Stop causes motor and drive shutdowns
- id: estop_cascade
name: "E-Stop Cascade"
mode: hierarchical
root_cause:
codes: ["ESTOP_001", "ESTOP_002"]
symptoms:
- pattern: motor_errors
- pattern: drive_faults
window_ms: 2000
mute_symptoms: true
auto_clear_with_root: true
# Battery critical causes low battery warnings
- id: battery_cascade
name: "Battery Cascade"
mode: hierarchical
root_cause:
codes: ["BATTERY_CRITICAL"]
symptoms:
- pattern: battery_warnings
window_ms: 1000
# Group sensor storms
- id: sensor_storm
name: "Sensor Storm"
mode: auto_cluster
match:
- pattern: sensor_errors
min_count: 3
window_ms: 2000
show_as_single: true
representative: highest_severity
Troubleshooting
Symptoms not being muted
Check that
mute_symptoms: trueis setVerify the symptom fault code matches a pattern in
symptomsEnsure the symptom occurs within
window_msof the root causeCheck fault manager logs for correlation matches
Cluster not forming
Verify
min_countfaults have occurred withinwindow_msCheck that fault codes match patterns in
matchClusters only become “active” after reaching
min_count
Root cause not detected
Verify the fault code exactly matches one in
root_cause.codesWildcards in
root_cause.codesare supported
Configuration validation
The fault manager validates configuration on startup. Check logs for:
[WARN] Rule 'my_rule' references unknown pattern: missing_pattern
[ERROR] Hierarchical rule 'my_rule' has no root_cause codes
See Also
Configuring Snapshot Capture - Capture topic data when faults are confirmed
Faults - Fault API requirements
FaultManager README - Detailed configuration reference