Configuring Fault Correlation

This tutorial shows how to configure fault correlation to automatically identify root causes and reduce fault noise in complex systems.

Overview

When multiple faults occur in rapid succession, they often share a common root cause. Fault correlation helps by:

  • Hierarchical mode: Identifying root cause → symptom relationships. Symptoms are muted while the root cause is displayed.

  • Auto-cluster mode: Grouping similar faults that occur within a time window into a single cluster.

This is similar to:

  • AUTOSAR DEM event combination - grouping related diagnostic events

  • AIOps alert correlation - reducing alert storms in monitoring systems

Benefits:

  • Reduced fault noise for operators

  • Faster root cause identification

  • Cleaner fault lists in dashboards

How It Works

Hierarchical Mode Flow:

@startuml
skinparam backgroundColor #FEFEFE
skinparam sequenceMessageAlign center

participant "Fault Source" as src
participant "FaultManager" as fm
participant "CorrelationEngine" as ce
database "Fault Storage" as db

== Root Cause Detected ==
src -> fm: ReportFault(ESTOP_001, CRITICAL)
fm -> ce: process_fault("ESTOP_001")
ce -> ce: Register as root cause\n(rule: estop_cascade, window: 2000ms)
ce --> fm: is_root_cause=true
fm -> db: Store fault
fm --> src: accepted

== Symptoms Arrive (within window) ==
src -> fm: ReportFault(MOTOR_COMM_FL, ERROR)
fm -> ce: process_fault("MOTOR_COMM_FL")
ce -> ce: Matches MOTOR_* pattern\nWithin 2000ms window
ce --> fm: should_mute=true, root_cause="ESTOP_001"
fm -> db: Store fault (muted)
note right: Fault stored but\nnot published to SSE

src -> fm: ReportFault(DRIVE_FAULT, ERROR)
fm -> ce: process_fault("DRIVE_FAULT")
ce --> fm: should_mute=true
fm -> db: Store fault (muted)

== Query Faults ==
src -> fm: ListFaults()
fm --> src: [ESTOP_001], muted_count=2

== Clear Root Cause ==
src -> fm: ClearFault(ESTOP_001)
fm -> ce: process_clear("ESTOP_001")
ce --> fm: auto_cleared=[MOTOR_COMM_FL, DRIVE_FAULT]
fm -> db: Clear all 3 faults
@enduml

Auto-Cluster Mode Flow:

@startuml
skinparam backgroundColor #FEFEFE

participant "Fault Source" as src
participant "FaultManager" as fm
participant "CorrelationEngine" as ce

note over ce: min_count=3, window=500ms\nrepresentative=highest_severity

== Faults Accumulate ==
src -> fm: ReportFault(SENSOR_001, ERROR)
fm -> ce: process_fault("SENSOR_001", "ERROR")
ce -> ce: Start pending cluster\n[SENSOR_001]
ce --> fm: cluster_id="sensor_storm_1"
note right: Not yet active\n(count=1 < min_count=3)

src -> fm: ReportFault(SENSOR_002, WARN)
fm -> ce: process_fault("SENSOR_002", "WARN")
ce -> ce: Add to pending\n[SENSOR_001, SENSOR_002]
ce --> fm: cluster_id="sensor_storm_1"

== Cluster Activates ==
src -> fm: ReportFault(SENSOR_003, CRITICAL)
fm -> ce: process_fault("SENSOR_003", "CRITICAL")
ce -> ce: count=3 >= min_count\n**CLUSTER ACTIVE**
ce -> ce: Select representative:\nSENSOR_003 (highest severity)
ce --> fm: should_mute=false (is representative)\nretroactive_mute=[SENSOR_001, SENSOR_002]
note right #LightGreen: SENSOR_003 shown\nothers muted

src -> fm: ReportFault(SENSOR_004, ERROR)
fm -> ce: process_fault("SENSOR_004", "ERROR")
ce --> fm: should_mute=true (not representative)

== Query Result ==
src -> fm: ListFaults(include_clusters=true)
fm --> src: [SENSOR_003]\nclusters: [{id: "sensor_storm_1",\n  fault_codes: [001,002,003,004],\n  representative: "SENSOR_003"}]
@enduml

Quick Start

  1. Create a correlation configuration file:

    # correlation.yaml
    correlation:
      enabled: true
      default_window_ms: 500
    
      patterns:
        motor_errors:
          codes: ["MOTOR_*"]
        sensor_errors:
          codes: ["SENSOR_*"]
    
      rules:
        - id: estop_cascade
          name: "E-Stop Cascade"
          mode: hierarchical
          root_cause:
            codes: ["ESTOP_001"]
          symptoms:
            - pattern: motor_errors
          window_ms: 2000
          mute_symptoms: true
          auto_clear_with_root: true
    
  2. Start the fault manager with correlation enabled:

    ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
      -p correlation.config_file:=/path/to/correlation.yaml
    
  3. Query faults with correlation data:

    curl "http://localhost:8080/api/v1/faults?include_muted=true"
    

Configuration Reference

Top-Level Settings

Parameter

Default

Description

enabled

false

Enable/disable fault correlation

default_window_ms

500

Default time window for correlation rules (ms)

Patterns Section

Patterns define reusable groups of fault codes with wildcard support:

patterns:
  motor_errors:
    codes: ["MOTOR_COMM_*", "MOTOR_TIMEOUT_*"]
  drive_faults:
    codes: ["DRIVE_*", "INVERTER_*"]
  sensor_errors:
    codes: ["SENSOR_*"]

Wildcard syntax:

  • * matches any sequence of characters

  • MOTOR_* matches MOTOR_001, MOTOR_OVERHEAT, etc.

  • Exact codes (without *) use fast string comparison

Hierarchical Mode

Hierarchical mode identifies root cause → symptom relationships.

rules:
  - id: estop_cascade
    name: "E-Stop Cascade"
    mode: hierarchical
    root_cause:
      codes: ["ESTOP_001", "ESTOP_002"]
    symptoms:
      - pattern: motor_errors
      - pattern: drive_faults
    window_ms: 2000
    mute_symptoms: true
    auto_clear_with_root: true

Parameter

Default

Description

root_cause.codes

(required)

Fault codes that trigger this rule

symptoms

(required)

List of pattern references for symptom faults

window_ms

default_window_ms

Time window to look for symptoms after root cause

mute_symptoms

true

Hide symptom faults from normal queries

auto_clear_with_root

true

Auto-clear symptoms when root cause is cleared

How it works:

  1. When ESTOP_001 is reported, it’s marked as a root cause

  2. Any MOTOR_* or DRIVE_* faults within 2000ms are marked as symptoms

  3. Symptoms are muted (not shown in default fault list)

  4. Clearing ESTOP_001 auto-clears all its symptoms

Auto-Cluster Mode

Auto-cluster mode groups similar faults into clusters.

rules:
  - id: sensor_storm
    name: "Sensor Storm"
    mode: auto_cluster
    match:
      - pattern: sensor_errors
    min_count: 3
    window_ms: 2000
    show_as_single: true
    representative: highest_severity

Parameter

Default

Description

match

(required)

List of pattern references for faults to cluster

min_count

3

Minimum faults needed to form a cluster

window_ms

default_window_ms

Time window for clustering faults

show_as_single

true

Show cluster as single fault (mute non-representatives)

representative

highest_severity

How to select the cluster representative: first, most_recent, or highest_severity

How it works:

  1. First SENSOR_* fault starts a potential cluster

  2. Additional SENSOR_* faults within 2000ms join the cluster

  3. When 3+ faults accumulate, cluster becomes active

  4. Only the representative fault is shown; others are muted

Querying Correlation Data

Basic query (includes counts):

curl http://localhost:8080/api/v1/faults

Response always includes:

{
  "faults": [...],  // fault objects
  "count": 5,
  "muted_count": 12,
  "cluster_count": 2
}

With muted fault details:

curl "http://localhost:8080/api/v1/faults?include_muted=true"
{
  "faults": [...],  // fault objects
  "muted_count": 12,
  "muted_faults": [
    {
      "fault_code": "MOTOR_COMM_001",
      "root_cause_code": "ESTOP_001",
      "rule_id": "estop_cascade",
      "delay_ms": 150
    }
  ]
}

With cluster details:

curl "http://localhost:8080/api/v1/faults?include_clusters=true"
{
  "faults": [...],  // fault objects
  "cluster_count": 2,
  "clusters": [
    {
      "cluster_id": "sensor_storm_1",
      "rule_id": "sensor_storm",
      "rule_name": "Sensor Storm",
      "representative_code": "SENSOR_CRITICAL_001",
      "representative_severity": "CRITICAL",
      "fault_codes": ["SENSOR_001", "SENSOR_002", "SENSOR_CRITICAL_001"],
      "first_at": "2026-01-19T10:00:00Z",
      "last_at": "2026-01-19T10:00:01Z"
    }
  ]
}

Auto-clear on root cause resolution:

curl -X DELETE http://localhost:8080/api/v1/faults/ESTOP_001
{
  "success": true,
  "fault_code": "ESTOP_001",
  "auto_cleared_codes": ["MOTOR_COMM_001", "MOTOR_TIMEOUT_002", "DRIVE_001"]
}

Example: Complete Configuration

# /etc/ros2_medkit/correlation.yaml
correlation:
  enabled: true
  default_window_ms: 500

  patterns:
    motor_errors:
      codes: ["MOTOR_COMM_*", "MOTOR_TIMEOUT_*", "MOTOR_OVERHEAT_*"]
    drive_faults:
      codes: ["DRIVE_*", "INVERTER_*"]
    sensor_errors:
      codes: ["SENSOR_*"]
    battery_warnings:
      codes: ["BATTERY_LOW", "BATTERY_CRITICAL"]

  rules:
    # E-Stop causes motor and drive shutdowns
    - id: estop_cascade
      name: "E-Stop Cascade"
      mode: hierarchical
      root_cause:
        codes: ["ESTOP_001", "ESTOP_002"]
      symptoms:
        - pattern: motor_errors
        - pattern: drive_faults
      window_ms: 2000
      mute_symptoms: true
      auto_clear_with_root: true

    # Battery critical causes low battery warnings
    - id: battery_cascade
      name: "Battery Cascade"
      mode: hierarchical
      root_cause:
        codes: ["BATTERY_CRITICAL"]
      symptoms:
        - pattern: battery_warnings
      window_ms: 1000

    # Group sensor storms
    - id: sensor_storm
      name: "Sensor Storm"
      mode: auto_cluster
      match:
        - pattern: sensor_errors
      min_count: 3
      window_ms: 2000
      show_as_single: true
      representative: highest_severity

Troubleshooting

Symptoms not being muted

  • Check that mute_symptoms: true is set

  • Verify the symptom fault code matches a pattern in symptoms

  • Ensure the symptom occurs within window_ms of the root cause

  • Check fault manager logs for correlation matches

Cluster not forming

  • Verify min_count faults have occurred within window_ms

  • Check that fault codes match patterns in match

  • Clusters only become “active” after reaching min_count

Root cause not detected

  • Verify the fault code exactly matches one in root_cause.codes

  • Wildcards in root_cause.codes are supported

Configuration validation

The fault manager validates configuration on startup. Check logs for:

[WARN] Rule 'my_rule' references unknown pattern: missing_pattern
[ERROR] Hierarchical rule 'my_rule' has no root_cause codes

See Also