Configuring Snapshot Capture

This tutorial shows how to configure snapshot capture to automatically preserve topic data when faults are confirmed, enabling post-mortem debugging.

Overview

When a fault transitions to CONFIRMED status, the system can automatically capture data from configured ROS 2 topics. This snapshot preserves the system state at the moment of fault occurrence, similar to:

  • AUTOSAR DEM freeze frames - diagnostic data captured at fault detection

  • SOVD environment data - system context for fault analysis

Snapshots are useful for:

  • Debugging intermittent faults that are hard to reproduce

  • Understanding system state when a fault occurred

  • Post-mortem analysis without real-time access to the robot

Note

Snapshots are automatically deleted when a fault is cleared via the DELETE /api/v1/faults/{code} endpoint or ~/clear_fault service.

Quick Start

  1. Start the fault manager with snapshot capture enabled:

    ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
      -p snapshots.enabled:=true \
      -p snapshots.default_topics:="['/odom', '/battery_state']"
    
  2. Start the gateway:

    ros2 launch ros2_medkit_gateway gateway.launch.py
    
  3. When a fault is confirmed, query its snapshots:

    curl http://localhost:8080/api/v1/faults/MOTOR_OVERHEAT/snapshots
    

Configuration Options

Configure snapshot capture via fault manager parameters:

Parameter

Default

Description

snapshots.enabled

true

Enable/disable snapshot capture

snapshots.default_topics

[]

Topics to capture for all faults

snapshots.config_file

""

Path to YAML config for fault-specific topics

snapshots.timeout_sec

1.0

Timeout waiting for topic message

snapshots.max_message_size

65536

Maximum message size in bytes (larger messages skipped)

snapshots.background_capture

false

Use background subscriptions (caches latest message)

Advanced Configuration

For fault-specific topic capture, create a YAML configuration file:

# snapshots.yaml
fault_specific:
  MOTOR_OVERHEAT:
    - /joint_states
    - /motor/temperature
  BATTERY_LOW:
    - /battery_state
    - /power_management/status

patterns:
  "MOTOR_.*":
    - /joint_states
    - /cmd_vel
  "SENSOR_.*":
    - /diagnostics

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code

  2. patterns - Regex pattern match (first matching pattern wins)

  3. default_topics - Fallback for all faults

Launch with config file:

ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
  -p snapshots.enabled:=true \
  -p snapshots.config_file:=/path/to/snapshots.yaml \
  -p snapshots.default_topics:="['/diagnostics']"

Querying Snapshots

Snapshots are included inline in the fault response as environment_data:

Get fault details with snapshots:

curl http://localhost:8080/api/v1/apps/motor_controller/faults/MOTOR_OVERHEAT

Response:

{
  "item": {
    "code": "MOTOR_OVERHEAT",
    "fault_name": "Motor temperature exceeded threshold",
    "severity": 2,
    "status": {
      "aggregatedStatus": "active",
      "testFailed": "1",
      "confirmedDTC": "1"
    }
  },
  "environment_data": {
    "extended_data_records": {
      "first_occurrence": "2026-02-04T10:30:00.000Z",
      "last_occurrence": "2026-02-04T10:35:00.000Z"
    },
    "snapshots": [
      {
        "type": "freeze_frame",
        "name": "motor_temperature",
        "data": 85.5,
        "x-medkit": {
          "topic": "/motor/temperature",
          "message_type": "sensor_msgs/msg/Temperature",
          "full_data": {"temperature": 85.5, "variance": 0.1},
          "captured_at": "2026-02-04T10:30:00.123Z"
        }
      },
      {
        "type": "rosbag",
        "name": "fault_recording",
        "bulk_data_uri": "/apps/motor_controller/bulk-data/rosbags/550e8400-e29b-41d4-a716-446655440000",
        "size_bytes": 1234567,
        "duration_sec": 6.0,
        "format": "mcap"
      }
    ]
  },
  "x-medkit": {
    "occurrence_count": 3,
    "reporting_sources": ["/powertrain/motor_controller"]
  }
}

Snapshot Types:

  • freeze_frame: Topic data captured at fault confirmation (JSON format)

  • rosbag: Recording file available via bulk-data endpoint (binary format)

Get snapshots from fault response using jq:

curl http://localhost:8080/api/v1/apps/motor_controller/faults/MOTOR_OVERHEAT | \
  jq '.environment_data.snapshots'

Example Workflow

This example demonstrates the complete snapshot capture workflow.

1. Configure and start the fault manager:

ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
  -p snapshots.enabled:=true \
  -p snapshots.default_topics:="['/odom']"

2. Start a node that publishes odometry:

ros2 topic pub /odom nav_msgs/msg/Odometry \
  "{pose: {pose: {position: {x: 1.5, y: 2.0}}}}" -r 10

3. Report a fault (it will be confirmed immediately by default):

ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'NAV_ERROR', event_type: 0, severity: 2, \
    description: 'Navigation failed', source_id: '/nav_node'}"

4. Query the captured snapshot:

curl http://localhost:8080/api/v1/apps/nav_node/faults/NAV_ERROR | \
  jq '.environment_data.snapshots'

The response will contain the odometry data that was captured at the moment the fault was confirmed.

Troubleshooting

No snapshots captured

  • Verify snapshots.enabled is true

  • Check that configured topics exist and are publishing

  • Increase snapshots.timeout_sec for slow-publishing topics

  • Check fault manager logs for capture errors

Empty topics object in response

  • The fault may have been cleared (snapshots are deleted on clear)

  • No topics were configured for this fault code

  • All configured topics timed out or exceeded size limit

Snapshot data truncated

  • Message exceeded snapshots.max_message_size

  • Increase the limit or filter to smaller topics

Wrong topics captured

  • Check topic resolution priority (fault_specific > patterns > default)

  • Verify regex patterns in config file are correct

Rosbag Capture (Time-Window Recording)

In addition to JSON snapshots, you can enable rosbag capture for “black box” style recording. This continuously buffers messages in memory and flushes them to a bag file when a fault is confirmed.

Key differences from JSON snapshots:

Feature

JSON Snapshots

Rosbag Capture

Data format

JSON (human-readable)

Binary (native ROS 2)

Time coverage

Point-in-time (at confirmation)

Time window (before + after fault)

Message fidelity

Converted to JSON

Original serialization preserved

Playback

N/A

ros2 bag play

Default

Enabled

Disabled

Enabling Rosbag Capture

ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
  -p snapshots.rosbag.enabled:=true \
  -p snapshots.rosbag.duration_sec:=5.0 \
  -p snapshots.rosbag.duration_after_sec:=1.0

This captures 5 seconds of data before the fault and 1 second after.

Rosbag Configuration Options

Parameter

Default

Description

snapshots.rosbag.enabled

false

Enable rosbag capture. When enabled, the system continuously buffers messages in memory and writes them to a bag file when faults are confirmed.

snapshots.rosbag.duration_sec

5.0

Ring buffer duration in seconds. This determines how much history is preserved before the fault confirmation. Larger values provide more context but consume more memory.

snapshots.rosbag.duration_after_sec

1.0

Post-fault recording duration. After a fault is confirmed, recording continues for this many seconds to capture immediate system response.

snapshots.rosbag.topics

"config"

Topic selection mode:

  • "config" - Use same topics as JSON snapshots (from config file)

  • "all" or "auto" - Auto-discover and record all available topics

  • "explicit" - Use only topics from include_topics list

snapshots.rosbag.include_topics

[]

Explicit list of topics to record (only used when topics: "explicit"). Example: ["/odom", "/joint_states", "/cmd_vel"]

snapshots.rosbag.exclude_topics

[]

Topics to exclude from recording (applies to all modes). Useful for filtering high-bandwidth topics like camera images.

snapshots.rosbag.format

"sqlite3"

Bag storage format: "sqlite3" (default, widely compatible) or "mcap" (more efficient compression, requires plugin).

snapshots.rosbag.storage_path

""

Directory for bag files. Empty string uses system temp directory (/tmp). Bags are named fault_{code}_{timestamp}/.

snapshots.rosbag.auto_cleanup

true

Automatically delete bag files when faults are cleared. Set to false to retain bags for manual analysis.

snapshots.rosbag.lazy_start

false

Controls when the ring buffer starts recording. See diagram below.

snapshots.rosbag.max_bag_size_mb

50

Maximum size per bag file in MB. When exceeded, rosbag2 creates additional segment files.

snapshots.rosbag.max_total_storage_mb

500

Total storage limit for all bag files. Oldest bags are automatically deleted when this limit is exceeded.

Understanding lazy_start Mode

The lazy_start parameter controls when the ring buffer starts recording:

lazy_start: false (default) - Recording starts immediately at node startup. Best for development and when you need maximum context for any fault.

@startuml
skinparam backgroundColor transparent
participant "Ring Buffer" as RB
participant "Fault Manager" as FM

== Node Startup ==
RB -> RB : Start recording
note right of RB #LightGreen : Buffer active\n(continuous)

... time passes (messages buffered) ...

== Fault Detected ==
FM -> FM : PREFAILED
note right of RB #LightGreen : duration_sec\nof data buffered

FM -> FM : CONFIRMED
FM -> RB : Flush buffer
RB -> RB : Write pre-fault data
note right of RB #LightBlue : Post-fault\nrecording

... duration_after_sec ...

RB -> RB : Save bag file
note right of RB #LightGreen : Resume\nbuffering
@enduml

lazy_start: true - Recording only starts when a fault enters PREFAILED state. Saves resources but may miss context if fault confirms before buffer fills.

@startuml
skinparam backgroundColor transparent
participant "Ring Buffer" as RB
participant "Fault Manager" as FM

== Node Startup ==
note right of RB #LightGray : Buffer inactive\n(saving resources)

... time passes (no recording) ...

== Fault Detected ==
FM -> FM : PREFAILED
FM -> RB : Start buffer
note right of RB #LightGreen : Recording\nstarts now

... buffer filling (may be < duration_sec) ...

FM -> FM : CONFIRMED
FM -> RB : Flush buffer
RB -> RB : Write pre-fault data
note right of RB #LightBlue : Post-fault\nrecording

... duration_after_sec ...

RB -> RB : Save bag file
note right of RB #LightGray : Buffer\ninactive
@enduml

When to use lazy_start: true:

  • Production systems with limited resources

  • When faults have reliable PREFAILED → CONFIRMED progression

  • Systems where most faults are debounced (enter PREFAILED first)

When to use lazy_start: false:

  • Development and debugging

  • When faults may skip PREFAILED state (severity 3 = CRITICAL)

  • When maximum fault context is more important than resource usage

Note

The "mcap" format requires rosbag2_storage_mcap to be installed. If not available, use "sqlite3" (default).

# Install MCAP support (optional)
sudo apt install ros-${ROS_DISTRO}-rosbag2-storage-mcap

Downloading Rosbag Files

Rosbag files are downloaded via SOVD bulk-data endpoints.

1. List available rosbags for an entity:

curl http://localhost:8080/api/v1/apps/motor_controller/bulk-data/rosbags

Response:

{
  "items": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "name": "MOTOR_OVERHEAT recording",
      "mimetype": "application/x-mcap",
      "size": 1234567,
      "creation_date": "2026-02-04T10:30:00.000Z",
      "x-medkit": {
        "fault_code": "MOTOR_OVERHEAT",
        "duration_sec": 6.0,
        "format": "mcap"
      }
    }
  ]
}

2. Download a specific rosbag:

Use the bulk_data_uri from the fault response, or construct from listing:

# Using bulk_data_uri from fault response
curl -O -J http://localhost:8080/api/v1/apps/motor_controller/bulk-data/rosbags/550e8400-e29b-41d4-a716-446655440000

The -J flag uses the server-provided filename from Content-Disposition header.

3. Play back the rosbag:

ros2 bag play MOTOR_OVERHEAT.mcap

Via ROS 2 service (alternative):

ros2 service call /fault_manager/get_rosbag ros2_medkit_msgs/srv/GetRosbag \
  "{fault_code: 'MOTOR_OVERHEAT'}"

Example: Production Configuration

For production use with conservative resource usage:

# config/snapshots.yaml
rosbag:
  enabled: true
  duration_sec: 3.0
  duration_after_sec: 0.5
  topics: "config"           # Use same topics as JSON snapshots
  lazy_start: true           # Save resources until fault detected
  format: "sqlite3"
  max_bag_size_mb: 25
  max_total_storage_mb: 200
  auto_cleanup: true

# Exclude high-bandwidth topics
# exclude_topics:
#   - /camera/image_raw
#   - /pointcloud

Example: Debugging Configuration

For development with maximum context:

rosbag:
  enabled: true
  duration_sec: 10.0         # 10 seconds before fault
  duration_after_sec: 2.0    # 2 seconds after
  topics: "config"
  lazy_start: false          # Always recording
  format: "sqlite3"
  storage_path: "/var/log/ros2_medkit/rosbags"
  max_bag_size_mb: 100
  max_total_storage_mb: 1000
  auto_cleanup: false        # Keep bags for analysis

See Also

Migration from Legacy Endpoints

If you were using the legacy snapshot endpoints, migrate to the new SOVD-compliant API:

Snapshots:

Previous (removed)

Current

GET /faults/{code}/snapshots

GET /apps/{app}/faults/{code}environment_data.snapshots[]

GET /components/{id}/faults/{code}/snapshots

GET /components/{id}/faults/{code}environment_data.snapshots[]

Rosbag Downloads:

Previous (removed)

Current

GET /faults/{code}/snapshots/bag

GET /apps/{app}/bulk-data/rosbags/{id}

GET /components/{id}/faults/{code}/snapshots/bag

GET /components/{id}/bulk-data/rosbags/{id}

Key Changes:

  1. Snapshots inline: No separate snapshot endpoint; data is in fault response

  2. Bulk-data pattern: Rosbags use SOVD bulk-data with UUID identifiers

  3. Entity-scoped: Bulk-data endpoints require entity path (e.g., /apps/motor)

  4. SOVD status: Fault response includes SOVD-compliant status object