Fault Manager Configuration

The ros2_medkit_fault_manager node aggregates and manages faults from multiple sources. This page documents all configuration parameters.

Basic Configuration

Storage

fault_manager:
  ros__parameters:
    storage_type: "sqlite"              # Storage backend: "sqlite" or "memory"
    database_path: "/var/lib/ros2_medkit/faults.db"  # Path for sqlite storage

Parameter

Default

Description

storage_type

sqlite

Storage backend. sqlite persists faults to disk, memory keeps in RAM only.

database_path

/var/lib/ros2_medkit/faults.db

File path for SQLite database. Directory must exist and be writable.

Debounce Settings

The fault manager uses AUTOSAR DEM-style debounce filtering to prevent fault flapping.

fault_manager:
  ros__parameters:
    confirmation_threshold: -1          # Counter threshold to confirm fault
    healing_enabled: false              # Enable auto-healing via PASSED events
    healing_threshold: 3                # Counter threshold to heal fault
    auto_confirm_after_sec: 0.0         # Auto-confirm timeout (0 = disabled)

Parameter

Default

Description

confirmation_threshold

-1

Number of FAILED events to confirm fault. Negative values mean more events needed. Use -3 to require 3 FAILED events before confirmation.

healing_enabled

false

When true, PASSED events can heal confirmed faults.

healing_threshold

3

Number of PASSED events to transition from CONFIRMED to HEALED.

auto_confirm_after_sec

0.0

Auto-confirm prefailed faults after this duration. Set to 0 to disable.

Tip

For immediate fault confirmation (no debounce), set confirmation_threshold: 0. Faults with SEVERITY_CRITICAL always bypass debounce regardless of this setting.

Per-Entity Thresholds

Different subsystems often have different failure characteristics. For example, a lidar sensor is binary (instant confirmation), while a motor controller may produce transient errors that need debouncing. Per-entity thresholds let you configure different debounce policies per reporting entity using longest-prefix matching on source_id.

fault_manager:
  ros__parameters:
    # Global defaults (used when no entity-specific match)
    confirmation_threshold: -1
    healing_enabled: false
    healing_threshold: 3

    # Path to YAML file with per-entity overrides
    entity_thresholds:
      config_file: "/etc/ros2_medkit/entity_thresholds.yaml"

The entity thresholds config file uses a simple map of entity path prefixes to threshold overrides:

# entity_thresholds.yaml
/sensors/lidar:
  confirmation_threshold: -1    # instant - lidar is binary
  healing_threshold: 1

/powertrain/motor_left:
  confirmation_threshold: -5    # motor has transients, need 5 events
  healing_threshold: 10

/safety:
  confirmation_threshold: -1    # instant, never auto-heal
  healing_enabled: false

Parameter

Default

Description

entity_thresholds.config_file

""

Path to YAML file with per-entity threshold overrides. Empty = disabled.

How matching works:

  • The source_id is the identifier passed in ReportFault service requests, typically the fully qualified name of the reporting ROS 2 node (e.g., /sensors/lidar/front_node). You can inspect actual source_id values in the reporting_sources field of existing faults via GET /api/v1/faults.

  • The source_id from ReportFault requests is matched against configured prefixes.

  • The longest matching prefix wins. For example, /sensors/lidar/front matches /sensors/lidar over /sensors.

  • Unspecified fields in an entity override inherit from the global defaults.

  • If no prefix matches, the global defaults apply.

  • The config file is loaded once at node startup. Changes require a node restart.

Note

When multiple entities report the same fault_code, each event applies the thresholds resolved from that event’s source_id. This means the debounce behavior follows the reporting entity, not the fault.

auto_confirm_after_sec and critical_immediate_confirm are global-only and cannot be overridden per-entity.

Snapshot Configuration

Snapshots capture diagnostic data when faults occur.

Basic Snapshot Settings

fault_manager:
  ros__parameters:
    snapshots:
      enabled: true                     # Enable snapshot capture
      background_capture: false         # Capture in background thread
      timeout_sec: 1.0                  # Timeout for topic sampling
      max_message_size: 65536           # Max message size in bytes (64KB)
      default_topics: []                # Topics to capture for all faults
      config_file: ""                   # Path to YAML config file
      recapture_cooldown_sec: 60.0      # Min seconds between snapshot captures per fault
      max_per_fault: 10                 # Max snapshots stored per fault code (0 = unlimited)
      capture_pool_size: 2              # Max concurrent capture threads (>= 1)
      capture_queue_depth: 16           # Max pending captures before policy applies (>= 1)
      capture_queue_full_policy: reject_newest  # reject_newest | drop_oldest

Parameter

Default

Description

snapshots.enabled

true

Master switch to enable/disable snapshot capture.

snapshots.background_capture

false

Capture snapshots in background thread (non-blocking).

snapshots.timeout_sec

1.0

Timeout for sampling each topic.

snapshots.max_message_size

65536

Maximum message size to capture (bytes). Larger messages are truncated.

snapshots.default_topics

[]

List of topics to capture for all faults.

snapshots.config_file

""

Path to YAML file with fault-specific snapshot configurations.

snapshots.recapture_cooldown_sec

60.0

Minimum seconds between snapshot captures for the same fault code. Prevents snapshot storms when a fault is reported repeatedly. Set to 0 to disable.

snapshots.max_per_fault

10

Maximum number of snapshots stored per fault code. When the limit is reached, new snapshots for that fault are rejected. Set to 0 for unlimited.

snapshots.capture_pool_size

2

Max concurrent capture threads under a fault storm (>= 1). The capture pool is shared and created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only; rosbag is single-writer and records one fault at a time regardless.

snapshots.capture_queue_depth

16

Max pending captures before the full-queue policy applies (>= 1).

snapshots.capture_queue_full_policy

reject_newest

Policy when the queue is full: reject_newest or drop_oldest.

Rosbag Recording

Capture continuous rosbag recordings around fault events.

fault_manager:
  ros__parameters:
    snapshots:
      rosbag:
        enabled: false                  # Enable rosbag recording
        duration_sec: 5.0               # Pre-fault buffer duration
        duration_after_sec: 1.0         # Post-fault recording duration
        topics: "entity"                # Topic selection: "entity" (default), "config", "all", "explicit"
        include_topics: []              # Additional topics to include
        exclude_topics: []              # Topics to exclude
        exclude_sensor_topics: true     # Auto-exclude image/points/depth/compressed in broad modes
        lazy_start: false               # Start recording on first fault
        format: "sqlite3"               # Storage format
        qos_match: true                 # Match each topic's publisher QoS
        storage_path: ""                # Custom storage path
        max_buffer_mb: 256              # Ring-buffer RAM cap
        max_bag_size_mb: 50             # Max size per bag file
        max_total_storage_mb: 500       # Max total storage
        auto_cleanup: true              # Auto-delete old bags

Parameter

Default

Description

rosbag.enabled

false

Enable rosbag recording for snapshots.

rosbag.duration_sec

5.0

Duration of pre-fault circular buffer.

rosbag.duration_after_sec

1.0

How long to record after fault.

rosbag.topics

entity

Topic selection mode: entity (default; write only the faulting node’s topics + /tf), config (per-fault), all, or explicit.

rosbag.exclude_sensor_topics

true

In broad modes (all/entity), auto-exclude high-bandwidth sensor topics (image/points/depth/compressed) to bound memory. Excluded topics are dropped silently; include_topics re-adds any you need.

rosbag.qos_match

true

Subscribe with each topic’s publisher-offered QoS for faithful capture instead of forcing best-effort.

rosbag.max_buffer_mb

256

Ring-buffer RAM cap; oldest buffered messages drop past it.

rosbag.lazy_start

false

Start recording only when first fault occurs.

rosbag.max_bag_size_mb

50

Maximum size per rosbag file (MB).

rosbag.max_total_storage_mb

500

Maximum total storage for all rosbags (MB).

rosbag.auto_cleanup

true

Automatically delete oldest rosbags when storage limit reached.

See also

Configuring Snapshot Capture for detailed snapshot configuration examples.

Correlation Configuration

Fault correlation identifies root causes and filters symptom faults.

fault_manager:
  ros__parameters:
    correlation:
      config_file: "/path/to/correlation_rules.yaml"
      cleanup_interval_sec: 5.0         # Interval for cleanup tasks

Parameter

Default

Description

correlation.config_file

""

Path to YAML file defining correlation rules.

correlation.cleanup_interval_sec

5.0

Interval for running correlation cleanup tasks.

See also

Configuring Fault Correlation for correlation rule syntax and examples.

Complete Example

fault_manager:
  ros__parameters:
    # Storage
    storage_type: "sqlite"
    database_path: "/var/lib/ros2_medkit/faults.db"

    # Debounce (require 3 FAILED events to confirm)
    confirmation_threshold: -3
    healing_enabled: true
    healing_threshold: 3
    auto_confirm_after_sec: 30.0

    # Per-entity debounce overrides
    entity_thresholds:
      config_file: "/etc/ros2_medkit/entity_thresholds.yaml"

    # Snapshots
    snapshots:
      enabled: true
      background_capture: true
      timeout_sec: 2.0
      max_message_size: 131072
      recapture_cooldown_sec: 60.0
      max_per_fault: 10
      default_topics:
        - /diagnostics
        - /rosout
      config_file: "/etc/ros2_medkit/snapshot_config.yaml"
      rosbag:
        enabled: true
        duration_sec: 10.0
        duration_after_sec: 2.0
        topics: "config"
        max_bag_size_mb: 100
        max_total_storage_mb: 1000
        auto_cleanup: true

    # Correlation
    correlation:
      config_file: "/etc/ros2_medkit/correlation_rules.yaml"
      cleanup_interval_sec: 10.0

See Also