Fault Manager Configuration
The ros2_medkit_fault_manager node aggregates and manages faults from multiple sources.
This page documents all configuration parameters.
Basic Configuration
Storage
fault_manager:
ros__parameters:
storage_type: "sqlite" # Storage backend: "sqlite" or "memory"
database_path: "/var/lib/ros2_medkit/faults.db" # Path for sqlite storage
Parameter |
Default |
Description |
|---|---|---|
|
|
Storage backend. |
|
|
File path for SQLite database. Directory must exist and be writable. |
Debounce Settings
The fault manager uses AUTOSAR DEM-style debounce filtering to prevent fault flapping.
fault_manager:
ros__parameters:
confirmation_threshold: -1 # Counter threshold to confirm fault
healing_enabled: false # Enable auto-healing via PASSED events
healing_threshold: 3 # Counter threshold to heal fault
auto_confirm_after_sec: 0.0 # Auto-confirm timeout (0 = disabled)
Parameter |
Default |
Description |
|---|---|---|
|
|
Number of FAILED events to confirm fault. Negative values mean more events needed.
Use |
|
|
When true, PASSED events can heal confirmed faults. |
|
|
Number of PASSED events to transition from CONFIRMED to HEALED. |
|
|
Auto-confirm prefailed faults after this duration. Set to 0 to disable. |
Tip
For immediate fault confirmation (no debounce), set confirmation_threshold: 0.
Faults with SEVERITY_CRITICAL always bypass debounce regardless of this setting.
Per-Entity Thresholds
Different subsystems often have different failure characteristics. For example, a lidar
sensor is binary (instant confirmation), while a motor controller may produce transient
errors that need debouncing. Per-entity thresholds let you configure different debounce
policies per reporting entity using longest-prefix matching on source_id.
fault_manager:
ros__parameters:
# Global defaults (used when no entity-specific match)
confirmation_threshold: -1
healing_enabled: false
healing_threshold: 3
# Path to YAML file with per-entity overrides
entity_thresholds:
config_file: "/etc/ros2_medkit/entity_thresholds.yaml"
The entity thresholds config file uses a simple map of entity path prefixes to threshold overrides:
# entity_thresholds.yaml
/sensors/lidar:
confirmation_threshold: -1 # instant - lidar is binary
healing_threshold: 1
/powertrain/motor_left:
confirmation_threshold: -5 # motor has transients, need 5 events
healing_threshold: 10
/safety:
confirmation_threshold: -1 # instant, never auto-heal
healing_enabled: false
Parameter |
Default |
Description |
|---|---|---|
|
|
Path to YAML file with per-entity threshold overrides. Empty = disabled. |
How matching works:
The
source_idis the identifier passed inReportFaultservice requests, typically the fully qualified name of the reporting ROS 2 node (e.g.,/sensors/lidar/front_node). You can inspect actualsource_idvalues in thereporting_sourcesfield of existing faults viaGET /api/v1/faults.The
source_idfromReportFaultrequests is matched against configured prefixes.The longest matching prefix wins. For example,
/sensors/lidar/frontmatches/sensors/lidarover/sensors.Unspecified fields in an entity override inherit from the global defaults.
If no prefix matches, the global defaults apply.
The config file is loaded once at node startup. Changes require a node restart.
Note
When multiple entities report the same fault_code, each event applies the
thresholds resolved from that event’s source_id. This means the debounce
behavior follows the reporting entity, not the fault.
auto_confirm_after_sec and critical_immediate_confirm are global-only and
cannot be overridden per-entity.
Snapshot Configuration
Snapshots capture diagnostic data when faults occur.
Basic Snapshot Settings
fault_manager:
ros__parameters:
snapshots:
enabled: true # Enable snapshot capture
background_capture: false # Capture in background thread
timeout_sec: 1.0 # Timeout for topic sampling
max_message_size: 65536 # Max message size in bytes (64KB)
default_topics: [] # Topics to capture for all faults
config_file: "" # Path to YAML config file
Parameter |
Default |
Description |
|---|---|---|
|
|
Master switch to enable/disable snapshot capture. |
|
|
Capture snapshots in background thread (non-blocking). |
|
|
Timeout for sampling each topic. |
|
|
Maximum message size to capture (bytes). Larger messages are truncated. |
|
|
List of topics to capture for all faults. |
|
|
Path to YAML file with fault-specific snapshot configurations. |
Rosbag Recording
Capture continuous rosbag recordings around fault events.
fault_manager:
ros__parameters:
snapshots:
rosbag:
enabled: false # Enable rosbag recording
duration_sec: 5.0 # Pre-fault buffer duration
duration_after_sec: 1.0 # Post-fault recording duration
topics: "config" # Topic selection: "config", "all", or "none"
include_topics: [] # Additional topics to include
exclude_topics: [] # Topics to exclude
lazy_start: false # Start recording on first fault
format: "sqlite3" # Storage format
storage_path: "" # Custom storage path
max_bag_size_mb: 50 # Max size per bag file
max_total_storage_mb: 500 # Max total storage
auto_cleanup: true # Auto-delete old bags
Parameter |
Default |
Description |
|---|---|---|
|
|
Enable rosbag recording for snapshots. |
|
|
Duration of pre-fault circular buffer. |
|
|
How long to record after fault. |
|
|
Topic selection mode: |
|
|
Start recording only when first fault occurs. |
|
|
Maximum size per rosbag file (MB). |
|
|
Maximum total storage for all rosbags (MB). |
|
|
Automatically delete oldest rosbags when storage limit reached. |
See also
Configuring Snapshot Capture for detailed snapshot configuration examples.
Correlation Configuration
Fault correlation identifies root causes and filters symptom faults.
fault_manager:
ros__parameters:
correlation:
config_file: "/path/to/correlation_rules.yaml"
cleanup_interval_sec: 5.0 # Interval for cleanup tasks
Parameter |
Default |
Description |
|---|---|---|
|
|
Path to YAML file defining correlation rules. |
|
|
Interval for running correlation cleanup tasks. |
See also
Configuring Fault Correlation for correlation rule syntax and examples.
Complete Example
fault_manager:
ros__parameters:
# Storage
storage_type: "sqlite"
database_path: "/var/lib/ros2_medkit/faults.db"
# Debounce (require 3 FAILED events to confirm)
confirmation_threshold: -3
healing_enabled: true
healing_threshold: 3
auto_confirm_after_sec: 30.0
# Per-entity debounce overrides
entity_thresholds:
config_file: "/etc/ros2_medkit/entity_thresholds.yaml"
# Snapshots
snapshots:
enabled: true
background_capture: true
timeout_sec: 2.0
max_message_size: 131072
default_topics:
- /diagnostics
- /rosout
config_file: "/etc/ros2_medkit/snapshot_config.yaml"
rosbag:
enabled: true
duration_sec: 10.0
duration_after_sec: 2.0
topics: "config"
max_bag_size_mb: 100
max_total_storage_mb: 1000
auto_cleanup: true
# Correlation
correlation:
config_file: "/etc/ros2_medkit/correlation_rules.yaml"
cleanup_interval_sec: 10.0
See Also
Configuring Snapshot Capture - Diagnostic snapshot configuration
Configuring Fault Correlation - Fault correlation rules
Message Definitions - Message definitions (Fault.msg, FaultEvent.msg)
ros2_medkit_fault_manager - FaultManager architecture