ros2_medkit_fault_manager
This section contains design documentation for the ros2_medkit_fault_manager project.
Architecture
The following diagram shows the relationships between the main components of the fault manager.
ROS 2 Medkit Fault Manager Class Architecture
Main Components
FaultManagerNode - The main ROS 2 node that provides fault management services - Extends
rclcpp::Node- Owns aFaultStorageimplementation for fault state persistence - Provides three ROS 2 services for fault reporting, querying, and clearing - Validates input parameters (fault_code, severity, source_id) - Logs fault lifecycle events at appropriate severity levelsFaultStorage - Abstract interface for fault storage backends - Defines the contract for fault storage implementations - Enables pluggable storage backends (in-memory, persistent, distributed) - Future implementations can be added in Issue #8: Fault Persistence Options
InMemoryFaultStorage - Thread-safe in-memory implementation of FaultStorage - Uses
std::mapkeyed byfault_codefor O(log n) lookups - Protected bystd::mutexfor concurrent service request handling - Aggregates reports from multiple sources into single fault entries - Implements severity escalation (higher severity overwrites lower) - Tracks occurrence counts and all reporting sourcesFaultState - Internal representation of a fault entry - Maps directly to
ros2_medkit_msgs::msg::Faultviato_msg()- Usesstd::setfor reporting_sources to ensure uniqueness - Tracks first and last occurrence timestamps - Manages fault status lifecycle (PENDING → CONFIRMED → CLEARED)
Services
~/report_fault
Reports a new fault or updates an existing one.
Input validation: fault_code and source_id cannot be empty, severity must be 0-3
Aggregation: Same fault_code from different sources creates a single fault entry
Severity escalation: Fault severity is updated if a higher severity is reported
Returns:
success=truewith message indicating “New fault” or “Fault updated”
~/get_faults
Queries faults with optional filtering.
Status filter: Filter by status (PENDING, CONFIRMED, CLEARED); defaults to CONFIRMED
Severity filter: When
filter_by_severity=true, returns only faults of specified severityReturns: List of
Faultmessages matching the filter criteria
~/clear_fault
Clears (acknowledges) a fault by setting its status to CLEARED.
Input validation: fault_code cannot be empty
Idempotent: Clearing an already-cleared fault succeeds
Returns:
success=trueif fault existed,success=falseif not found
Design Decisions
Thread Safety
All FaultStorage public methods acquire a mutex lock to ensure thread safety
when handling concurrent service requests. This is essential since ROS 2 service
callbacks may execute on different threads.
Fault Aggregation
Multiple reports of the same fault_code (from same or different sources) are
aggregated into a single fault entry. This provides:
Deduplication: Prevents fault flooding from repeated reports
Source tracking: Identifies all sources reporting the same fault
Occurrence counting: Tracks how many times a fault was reported
Severity Escalation
When a fault is re-reported with a higher severity, the stored severity is updated.
This ensures the fault reflects the worst-case condition. Severity levels are ordered:
INFO(0) < WARN(1) < ERROR(2) < CRITICAL(3).
Status Lifecycle
Faults follow a lifecycle: PENDING → CONFIRMED → CLEARED
PENDING: Initial status when fault is first reported
CONFIRMED: Status after automatic or manual confirmation (Issue #6)
CLEARED: Status after fault is cleared/acknowledged
Currently, faults start as PENDING and move to CLEARED when explicitly cleared. Automatic PENDING → CONFIRMED transitions will be implemented in Issue #6.