Configuring Snapshot Capture
This tutorial shows how to configure snapshot capture to automatically preserve topic data when faults are confirmed, enabling post-mortem debugging.
Overview
When a fault transitions to CONFIRMED status, the system can automatically capture data from configured ROS 2 topics. This snapshot preserves the system state at the moment of fault occurrence, similar to:
AUTOSAR DEM freeze frames - diagnostic data captured at fault detection
SOVD environment data - system context for fault analysis
Snapshots are useful for:
Debugging intermittent faults that are hard to reproduce
Understanding system state when a fault occurred
Post-mortem analysis without real-time access to the robot
Note
Snapshots are automatically deleted when a fault is cleared via the
DELETE /api/v1/faults/{code} endpoint or ~/clear_fault service.
Quick Start
Start the fault manager with snapshot capture enabled:
ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \ -p snapshots.enabled:=true \ -p snapshots.default_topics:="['/odom', '/battery_state']"
Start the gateway:
ros2 launch ros2_medkit_gateway gateway.launch.py
When a fault is confirmed, query its snapshots:
curl http://localhost:8080/api/v1/faults/MOTOR_OVERHEAT/snapshots
Configuration Options
Configure snapshot capture via fault manager parameters:
Parameter |
Default |
Description |
|---|---|---|
|
|
Enable/disable snapshot capture |
|
|
Topics to capture for all faults |
|
|
Path to YAML config for fault-specific topics |
|
|
Timeout waiting for topic message |
|
|
Maximum message size in bytes (larger messages skipped) |
|
|
Use background subscriptions (caches latest message) |
Advanced Configuration
For fault-specific topic capture, create a YAML configuration file:
# snapshots.yaml
fault_specific:
MOTOR_OVERHEAT:
- /joint_states
- /motor/temperature
BATTERY_LOW:
- /battery_state
- /power_management/status
patterns:
"MOTOR_.*":
- /joint_states
- /cmd_vel
"SENSOR_.*":
- /diagnostics
Topic Resolution Priority:
fault_specific- Exact match for fault codepatterns- Regex pattern match (first matching pattern wins)default_topics- Fallback for all faults
Launch with config file:
ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
-p snapshots.enabled:=true \
-p snapshots.config_file:=/path/to/snapshots.yaml \
-p snapshots.default_topics:="['/diagnostics']"
Querying Snapshots
Snapshots are included inline in the fault response as environment_data:
Get fault details with snapshots:
curl http://localhost:8080/api/v1/apps/motor_controller/faults/MOTOR_OVERHEAT
Response:
{
"item": {
"code": "MOTOR_OVERHEAT",
"fault_name": "Motor temperature exceeded threshold",
"severity": 2,
"status": {
"aggregatedStatus": "active",
"testFailed": "1",
"confirmedDTC": "1"
}
},
"environment_data": {
"extended_data_records": {
"first_occurrence": "2026-02-04T10:30:00.000Z",
"last_occurrence": "2026-02-04T10:35:00.000Z"
},
"snapshots": [
{
"type": "freeze_frame",
"name": "motor_temperature",
"data": 85.5,
"x-medkit": {
"topic": "/motor/temperature",
"message_type": "sensor_msgs/msg/Temperature",
"full_data": {"temperature": 85.5, "variance": 0.1},
"captured_at": "2026-02-04T10:30:00.123Z"
}
},
{
"type": "rosbag",
"name": "fault_recording",
"bulk_data_uri": "/apps/motor_controller/bulk-data/rosbags/550e8400-e29b-41d4-a716-446655440000",
"size_bytes": 1234567,
"duration_sec": 6.0,
"format": "mcap"
}
]
},
"x-medkit": {
"occurrence_count": 3,
"reporting_sources": ["/powertrain/motor_controller"]
}
}
Snapshot Types:
freeze_frame: Topic data captured at fault confirmation (JSON format)rosbag: Recording file available via bulk-data endpoint (binary format)
Get snapshots from fault response using jq:
curl http://localhost:8080/api/v1/apps/motor_controller/faults/MOTOR_OVERHEAT | \
jq '.environment_data.snapshots'
Example Workflow
This example demonstrates the complete snapshot capture workflow.
1. Configure and start the fault manager:
ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
-p snapshots.enabled:=true \
-p snapshots.default_topics:="['/odom']"
2. Start a node that publishes odometry:
ros2 topic pub /odom nav_msgs/msg/Odometry \
"{pose: {pose: {position: {x: 1.5, y: 2.0}}}}" -r 10
3. Report a fault (it will be confirmed immediately by default):
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'NAV_ERROR', event_type: 0, severity: 2, \
description: 'Navigation failed', source_id: '/nav_node'}"
4. Query the captured snapshot:
curl http://localhost:8080/api/v1/apps/nav_node/faults/NAV_ERROR | \
jq '.environment_data.snapshots'
The response will contain the odometry data that was captured at the moment the fault was confirmed.
Troubleshooting
No snapshots captured
Verify
snapshots.enabledistrueCheck that configured topics exist and are publishing
Increase
snapshots.timeout_secfor slow-publishing topicsCheck fault manager logs for capture errors
Empty topics object in response
The fault may have been cleared (snapshots are deleted on clear)
No topics were configured for this fault code
All configured topics timed out or exceeded size limit
Snapshot data truncated
Message exceeded
snapshots.max_message_sizeIncrease the limit or filter to smaller topics
Wrong topics captured
Check topic resolution priority (fault_specific > patterns > default)
Verify regex patterns in config file are correct
Rosbag Capture (Time-Window Recording)
In addition to JSON snapshots, you can enable rosbag capture for “black box” style recording. This continuously buffers messages in memory and flushes them to a bag file when a fault is confirmed.
Key differences from JSON snapshots:
Feature |
JSON Snapshots |
Rosbag Capture |
|---|---|---|
Data format |
JSON (human-readable) |
Binary (native ROS 2) |
Time coverage |
Point-in-time (at confirmation) |
Time window (before + after fault) |
Message fidelity |
Converted to JSON |
Original serialization preserved |
Playback |
N/A |
|
Default |
Enabled |
Disabled |
Enabling Rosbag Capture
ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
-p snapshots.rosbag.enabled:=true \
-p snapshots.rosbag.duration_sec:=5.0 \
-p snapshots.rosbag.duration_after_sec:=1.0
This captures 5 seconds of data before the fault and 1 second after.
Rosbag Configuration Options
Parameter |
Default |
Description |
|---|---|---|
|
|
Enable rosbag capture. When enabled, the system continuously buffers messages in memory and writes them to a bag file when faults are confirmed. |
|
|
Ring buffer duration in seconds. This determines how much history is preserved before the fault confirmation. Larger values provide more context but consume more memory. |
|
|
Post-fault recording duration. After a fault is confirmed, recording continues for this many seconds to capture immediate system response. |
|
|
Topic selection mode:
|
|
|
Explicit list of topics to record (only used when |
|
|
Topics to exclude from recording (applies to all modes). Useful for filtering high-bandwidth topics like camera images. |
|
|
Bag storage format: |
|
|
Directory for bag files. Empty string uses system temp directory
( |
|
|
Automatically delete bag files when faults are cleared. Set to |
|
|
Controls when the ring buffer starts recording. See diagram below. |
|
|
Maximum size per bag file in MB. When exceeded, rosbag2 creates additional segment files. |
|
|
Total storage limit for all bag files. Oldest bags are automatically deleted when this limit is exceeded. |
Understanding lazy_start Mode
The lazy_start parameter controls when the ring buffer starts recording:
lazy_start: false (default) - Recording starts immediately at node startup. Best for development and when you need maximum context for any fault.
lazy_start: true - Recording only starts when a fault enters PREFAILED state. Saves resources but may miss context if fault confirms before buffer fills.
When to use lazy_start: true:
Production systems with limited resources
When faults have reliable PREFAILED → CONFIRMED progression
Systems where most faults are debounced (enter PREFAILED first)
When to use lazy_start: false:
Development and debugging
When faults may skip PREFAILED state (severity 3 = CRITICAL)
When maximum fault context is more important than resource usage
Note
The "mcap" format requires rosbag2_storage_mcap to be installed.
If not available, use "sqlite3" (default).
# Install MCAP support (optional)
sudo apt install ros-${ROS_DISTRO}-rosbag2-storage-mcap
Downloading Rosbag Files
Rosbag files are downloaded via SOVD bulk-data endpoints.
1. List available rosbags for an entity:
curl http://localhost:8080/api/v1/apps/motor_controller/bulk-data/rosbags
Response:
{
"items": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "MOTOR_OVERHEAT recording",
"mimetype": "application/x-mcap",
"size": 1234567,
"creation_date": "2026-02-04T10:30:00.000Z",
"x-medkit": {
"fault_code": "MOTOR_OVERHEAT",
"duration_sec": 6.0,
"format": "mcap"
}
}
]
}
2. Download a specific rosbag:
Use the bulk_data_uri from the fault response, or construct from listing:
# Using bulk_data_uri from fault response
curl -O -J http://localhost:8080/api/v1/apps/motor_controller/bulk-data/rosbags/550e8400-e29b-41d4-a716-446655440000
The -J flag uses the server-provided filename from Content-Disposition header.
3. Play back the rosbag:
ros2 bag play MOTOR_OVERHEAT.mcap
Via ROS 2 service (alternative):
ros2 service call /fault_manager/get_rosbag ros2_medkit_msgs/srv/GetRosbag \
"{fault_code: 'MOTOR_OVERHEAT'}"
Example: Production Configuration
For production use with conservative resource usage:
# config/snapshots.yaml
rosbag:
enabled: true
duration_sec: 3.0
duration_after_sec: 0.5
topics: "config" # Use same topics as JSON snapshots
lazy_start: true # Save resources until fault detected
format: "sqlite3"
max_bag_size_mb: 25
max_total_storage_mb: 200
auto_cleanup: true
# Exclude high-bandwidth topics
# exclude_topics:
# - /camera/image_raw
# - /pointcloud
Example: Debugging Configuration
For development with maximum context:
rosbag:
enabled: true
duration_sec: 10.0 # 10 seconds before fault
duration_after_sec: 2.0 # 2 seconds after
topics: "config"
lazy_start: false # Always recording
format: "sqlite3"
storage_path: "/var/log/ros2_medkit/rosbags"
max_bag_size_mb: 100
max_total_storage_mb: 1000
auto_cleanup: false # Keep bags for analysis
See Also
REST API Reference - REST API reference (Bulk Data section)
Faults - Fault API requirements
Gateway README - REST API reference
config/snapshots.yaml - Full configuration reference
Migration from Legacy Endpoints
If you were using the legacy snapshot endpoints, migrate to the new SOVD-compliant API:
Snapshots:
Previous (removed) |
Current |
|---|---|
|
|
|
|
Rosbag Downloads:
Previous (removed) |
Current |
|---|---|
|
|
|
|
Key Changes:
Snapshots inline: No separate snapshot endpoint; data is in fault response
Bulk-data pattern: Rosbags use SOVD bulk-data with UUID identifiers
Entity-scoped: Bulk-data endpoints require entity path (e.g.,
/apps/motor)SOVD status: Fault response includes SOVD-compliant
statusobject