Configuring Fault Correlation ============================= This tutorial shows how to configure fault correlation to automatically identify root causes and reduce fault noise in complex systems. .. contents:: Table of Contents :local: :depth: 2 Overview -------- When multiple faults occur in rapid succession, they often share a common root cause. Fault correlation helps by: - **Hierarchical mode**: Identifying root cause → symptom relationships. Symptoms are muted while the root cause is displayed. - **Auto-cluster mode**: Grouping similar faults that occur within a time window into a single cluster. This is similar to: - **AUTOSAR DEM event combination** - grouping related diagnostic events - **AIOps alert correlation** - reducing alert storms in monitoring systems Benefits: - Reduced fault noise for operators - Faster root cause identification - Cleaner fault lists in dashboards How It Works ~~~~~~~~~~~~ **Hierarchical Mode Flow:** .. uml:: @startuml skinparam backgroundColor #FEFEFE skinparam sequenceMessageAlign center participant "Fault Source" as src participant "FaultManager" as fm participant "CorrelationEngine" as ce database "Fault Storage" as db == Root Cause Detected == src -> fm: ReportFault(ESTOP_001, CRITICAL) fm -> ce: process_fault("ESTOP_001") ce -> ce: Register as root cause\n(rule: estop_cascade, window: 2000ms) ce --> fm: is_root_cause=true fm -> db: Store fault fm --> src: accepted == Symptoms Arrive (within window) == src -> fm: ReportFault(MOTOR_COMM_FL, ERROR) fm -> ce: process_fault("MOTOR_COMM_FL") ce -> ce: Matches MOTOR_* pattern\nWithin 2000ms window ce --> fm: should_mute=true, root_cause="ESTOP_001" fm -> db: Store fault (muted) note right: Fault stored but\nnot published to SSE src -> fm: ReportFault(DRIVE_FAULT, ERROR) fm -> ce: process_fault("DRIVE_FAULT") ce --> fm: should_mute=true fm -> db: Store fault (muted) == Query Faults == src -> fm: ListFaults() fm --> src: [ESTOP_001], muted_count=2 == Clear Root Cause == src -> fm: ClearFault(ESTOP_001) fm -> ce: process_clear("ESTOP_001") ce --> fm: auto_cleared=[MOTOR_COMM_FL, DRIVE_FAULT] fm -> db: Clear all 3 faults @enduml **Auto-Cluster Mode Flow:** .. uml:: @startuml skinparam backgroundColor #FEFEFE participant "Fault Source" as src participant "FaultManager" as fm participant "CorrelationEngine" as ce note over ce: min_count=3, window=500ms\nrepresentative=highest_severity == Faults Accumulate == src -> fm: ReportFault(SENSOR_001, ERROR) fm -> ce: process_fault("SENSOR_001", "ERROR") ce -> ce: Start pending cluster\n[SENSOR_001] ce --> fm: cluster_id="sensor_storm_1" note right: Not yet active\n(count=1 < min_count=3) src -> fm: ReportFault(SENSOR_002, WARN) fm -> ce: process_fault("SENSOR_002", "WARN") ce -> ce: Add to pending\n[SENSOR_001, SENSOR_002] ce --> fm: cluster_id="sensor_storm_1" == Cluster Activates == src -> fm: ReportFault(SENSOR_003, CRITICAL) fm -> ce: process_fault("SENSOR_003", "CRITICAL") ce -> ce: count=3 >= min_count\n**CLUSTER ACTIVE** ce -> ce: Select representative:\nSENSOR_003 (highest severity) ce --> fm: should_mute=false (is representative)\nretroactive_mute=[SENSOR_001, SENSOR_002] note right #LightGreen: SENSOR_003 shown\nothers muted src -> fm: ReportFault(SENSOR_004, ERROR) fm -> ce: process_fault("SENSOR_004", "ERROR") ce --> fm: should_mute=true (not representative) == Query Result == src -> fm: ListFaults(include_clusters=true) fm --> src: [SENSOR_003]\nclusters: [{id: "sensor_storm_1",\n fault_codes: [001,002,003,004],\n representative: "SENSOR_003"}] @enduml Quick Start ----------- 1. **Create a correlation configuration file:** .. code-block:: yaml # correlation.yaml correlation: enabled: true default_window_ms: 500 patterns: motor_errors: codes: ["MOTOR_*"] sensor_errors: codes: ["SENSOR_*"] rules: - id: estop_cascade name: "E-Stop Cascade" mode: hierarchical root_cause: codes: ["ESTOP_001"] symptoms: - pattern: motor_errors window_ms: 2000 mute_symptoms: true auto_clear_with_root: true 2. **Start the fault manager with correlation enabled:** .. code-block:: bash ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \ -p correlation.config_file:=/path/to/correlation.yaml 3. **Query faults with correlation data:** .. code-block:: bash curl "http://localhost:8080/api/v1/faults?include_muted=true" Configuration Reference ----------------------- Top-Level Settings ~~~~~~~~~~~~~~~~~~ .. list-table:: :widths: 25 15 60 :header-rows: 1 * - Parameter - Default - Description * - ``enabled`` - ``false`` - Enable/disable fault correlation * - ``default_window_ms`` - ``500`` - Default time window for correlation rules (ms) Patterns Section ~~~~~~~~~~~~~~~~ Patterns define reusable groups of fault codes with wildcard support: .. code-block:: yaml patterns: motor_errors: codes: ["MOTOR_COMM_*", "MOTOR_TIMEOUT_*"] drive_faults: codes: ["DRIVE_*", "INVERTER_*"] sensor_errors: codes: ["SENSOR_*"] **Wildcard syntax:** - ``*`` matches any sequence of characters - ``MOTOR_*`` matches ``MOTOR_001``, ``MOTOR_OVERHEAT``, etc. - Exact codes (without ``*``) use fast string comparison Hierarchical Mode ----------------- Hierarchical mode identifies root cause → symptom relationships. .. code-block:: yaml rules: - id: estop_cascade name: "E-Stop Cascade" mode: hierarchical root_cause: codes: ["ESTOP_001", "ESTOP_002"] symptoms: - pattern: motor_errors - pattern: drive_faults window_ms: 2000 mute_symptoms: true auto_clear_with_root: true .. list-table:: :widths: 25 15 60 :header-rows: 1 * - Parameter - Default - Description * - ``root_cause.codes`` - (required) - Fault codes that trigger this rule * - ``symptoms`` - (required) - List of pattern references for symptom faults * - ``window_ms`` - ``default_window_ms`` - Time window to look for symptoms after root cause * - ``mute_symptoms`` - ``true`` - Hide symptom faults from normal queries * - ``auto_clear_with_root`` - ``true`` - Auto-clear symptoms when root cause is cleared **How it works:** 1. When ``ESTOP_001`` is reported, it's marked as a root cause 2. Any ``MOTOR_*`` or ``DRIVE_*`` faults within 2000ms are marked as symptoms 3. Symptoms are muted (not shown in default fault list) 4. Clearing ``ESTOP_001`` auto-clears all its symptoms Auto-Cluster Mode ----------------- Auto-cluster mode groups similar faults into clusters. .. code-block:: yaml rules: - id: sensor_storm name: "Sensor Storm" mode: auto_cluster match: - pattern: sensor_errors min_count: 3 window_ms: 2000 show_as_single: true representative: highest_severity .. list-table:: :widths: 25 15 60 :header-rows: 1 * - Parameter - Default - Description * - ``match`` - (required) - List of pattern references for faults to cluster * - ``min_count`` - ``3`` - Minimum faults needed to form a cluster * - ``window_ms`` - ``default_window_ms`` - Time window for clustering faults * - ``show_as_single`` - ``true`` - Show cluster as single fault (mute non-representatives) * - ``representative`` - ``highest_severity`` - How to select the cluster representative: ``first``, ``most_recent``, or ``highest_severity`` **How it works:** 1. First ``SENSOR_*`` fault starts a potential cluster 2. Additional ``SENSOR_*`` faults within 2000ms join the cluster 3. When 3+ faults accumulate, cluster becomes active 4. Only the representative fault is shown; others are muted Querying Correlation Data ------------------------- **Basic query (includes counts):** .. code-block:: bash curl http://localhost:8080/api/v1/faults Response always includes: .. code-block:: javascript { "faults": [...], // fault objects "count": 5, "muted_count": 12, "cluster_count": 2 } **With muted fault details:** .. code-block:: bash curl "http://localhost:8080/api/v1/faults?include_muted=true" .. code-block:: javascript { "faults": [...], // fault objects "muted_count": 12, "muted_faults": [ { "fault_code": "MOTOR_COMM_001", "root_cause_code": "ESTOP_001", "rule_id": "estop_cascade", "delay_ms": 150 } ] } **With cluster details:** .. code-block:: bash curl "http://localhost:8080/api/v1/faults?include_clusters=true" .. code-block:: javascript { "faults": [...], // fault objects "cluster_count": 2, "clusters": [ { "cluster_id": "sensor_storm_1", "rule_id": "sensor_storm", "rule_name": "Sensor Storm", "representative_code": "SENSOR_CRITICAL_001", "representative_severity": "CRITICAL", "fault_codes": ["SENSOR_001", "SENSOR_002", "SENSOR_CRITICAL_001"], "first_at": "2026-01-19T10:00:00Z", "last_at": "2026-01-19T10:00:01Z" } ] } **Auto-clear on root cause resolution:** .. code-block:: bash curl -X DELETE http://localhost:8080/api/v1/faults/ESTOP_001 .. code-block:: json { "success": true, "fault_code": "ESTOP_001", "auto_cleared_codes": ["MOTOR_COMM_001", "MOTOR_TIMEOUT_002", "DRIVE_001"] } Example: Complete Configuration ------------------------------- .. code-block:: yaml # /etc/ros2_medkit/correlation.yaml correlation: enabled: true default_window_ms: 500 patterns: motor_errors: codes: ["MOTOR_COMM_*", "MOTOR_TIMEOUT_*", "MOTOR_OVERHEAT_*"] drive_faults: codes: ["DRIVE_*", "INVERTER_*"] sensor_errors: codes: ["SENSOR_*"] battery_warnings: codes: ["BATTERY_LOW", "BATTERY_CRITICAL"] rules: # E-Stop causes motor and drive shutdowns - id: estop_cascade name: "E-Stop Cascade" mode: hierarchical root_cause: codes: ["ESTOP_001", "ESTOP_002"] symptoms: - pattern: motor_errors - pattern: drive_faults window_ms: 2000 mute_symptoms: true auto_clear_with_root: true # Battery critical causes low battery warnings - id: battery_cascade name: "Battery Cascade" mode: hierarchical root_cause: codes: ["BATTERY_CRITICAL"] symptoms: - pattern: battery_warnings window_ms: 1000 # Group sensor storms - id: sensor_storm name: "Sensor Storm" mode: auto_cluster match: - pattern: sensor_errors min_count: 3 window_ms: 2000 show_as_single: true representative: highest_severity Troubleshooting --------------- **Symptoms not being muted** - Check that ``mute_symptoms: true`` is set - Verify the symptom fault code matches a pattern in ``symptoms`` - Ensure the symptom occurs within ``window_ms`` of the root cause - Check fault manager logs for correlation matches **Cluster not forming** - Verify ``min_count`` faults have occurred within ``window_ms`` - Check that fault codes match patterns in ``match`` - Clusters only become "active" after reaching ``min_count`` **Root cause not detected** - Verify the fault code exactly matches one in ``root_cause.codes`` - Wildcards in ``root_cause.codes`` are supported **Configuration validation** The fault manager validates configuration on startup. Check logs for: .. code-block:: text [WARN] Rule 'my_rule' references unknown pattern: missing_pattern [ERROR] Hierarchical rule 'my_rule' has no root_cause codes See Also -------- - :doc:`snapshots` - Capture topic data when faults are confirmed - :doc:`../requirements/specs/faults` - Fault API requirements - `FaultManager README `_ - Detailed configuration reference