ros2_medkit_fault_manager

This section contains design documentation for the ros2_medkit_fault_manager project.

Architecture

The following diagram shows the relationships between the main components of the fault manager.

Main Components

FaultManagerNode - The main ROS 2 node that provides fault management services - Extends rclcpp::Node - Owns a FaultStorage implementation for fault state persistence - Provides three ROS 2 services for fault reporting, querying, and clearing - Validates input parameters (fault_code, severity, source_id) - Logs fault lifecycle events at appropriate severity levels
FaultStorage - Abstract interface for fault storage backends - Defines the contract for fault storage implementations - Enables pluggable storage backends (in-memory, persistent, distributed) - Future implementations can be added in Issue #8: Fault Persistence Options
InMemoryFaultStorage - Thread-safe in-memory implementation of FaultStorage - Uses std::map keyed by fault_code for O(log n) lookups - Protected by std::mutex for concurrent service request handling - Aggregates reports from multiple sources into single fault entries - Implements severity escalation (higher severity overwrites lower) - Tracks occurrence counts and all reporting sources
FaultState - Internal representation of a fault entry - Maps directly to ros2_medkit_msgs::msg::Fault via to_msg() - Uses std::set for reporting_sources to ensure uniqueness - Tracks first and last occurrence timestamps - Manages fault status lifecycle with debounce (PREFAILED → CONFIRMED → CLEARED)

Services

~/report_fault

Reports a new fault or updates an existing one.

Input validation: fault_code and source_id cannot be empty, event_type must be valid
Event types: FAILED (fault detected) or PASSED (fault condition cleared)
Debounce: FAILED events decrement counter, PASSED events increment counter
Aggregation: Same fault_code from different sources creates a single fault entry
Severity escalation: Fault severity is updated if a higher severity is reported
Returns: accepted=true if event was processed

~/list_faults

Queries faults with optional filtering.

Status filter: Filter by status (PREFAILED, PREPASSED, CONFIRMED, HEALED, CLEARED); defaults to CONFIRMED
Severity filter: When filter_by_severity=true, returns only faults of specified severity
Returns: List of Fault messages matching the filter criteria

~/clear_fault

Clears (acknowledges) a fault by setting its status to CLEARED.

Input validation: fault_code cannot be empty
Idempotent: Clearing an already-cleared fault succeeds
Returns: success=true if fault existed, success=false if not found

Design Decisions

Thread Safety

All FaultStorage public methods acquire a mutex lock to ensure thread safety when handling concurrent service requests. This is essential since ROS 2 service callbacks may execute on different threads.

Fault Aggregation

Multiple reports of the same fault_code (from same or different sources) are aggregated into a single fault entry. This provides:

Deduplication: Prevents fault flooding from repeated reports
Source tracking: Identifies all sources reporting the same fault
Occurrence counting: Tracks how many times a fault was reported

Severity Escalation

When a fault is re-reported with a higher severity, the stored severity is updated. This ensures the fault reflects the worst-case condition. Severity levels are ordered: INFO(0) < WARN(1) < ERROR(2) < CRITICAL(3).

Status Lifecycle (Debounce Model)

Faults follow an AUTOSAR DEM-style debounce lifecycle:

PREFAILED: Debounce counter < 0 but above confirmation threshold (fault trending towards confirmation)
PREPASSED: Debounce counter > 0 but below healing threshold (fault trending towards healing)
CONFIRMED: Debounce counter <= confirmation threshold (e.g., -3). Fault is active and verified.
HEALED: Debounce counter >= healing threshold (if healing enabled). Fault resolved by PASSED events.
CLEARED: Fault manually acknowledged via ClearFault service

FAILED events decrement the debounce counter (towards confirmation). PASSED events increment the debounce counter (towards healing). CRITICAL severity bypasses debounce and confirms immediately.