Configuring Snapshot Capture ============================ This tutorial shows how to configure snapshot capture to automatically preserve topic data when faults are confirmed, enabling post-mortem debugging. .. contents:: Table of Contents :local: :depth: 2 Overview -------- When a fault transitions to CONFIRMED status, the system can automatically capture data from configured ROS 2 topics. This snapshot preserves the system state at the moment of fault occurrence, similar to: - **AUTOSAR DEM freeze frames** - diagnostic data captured at fault detection - **SOVD environment data** - system context for fault analysis Snapshots are useful for: - Debugging intermittent faults that are hard to reproduce - Understanding system state when a fault occurred - Post-mortem analysis without real-time access to the robot .. note:: Snapshots are automatically deleted when a fault is cleared via the ``DELETE /api/v1/faults/{code}`` endpoint or ``~/clear_fault`` service. Quick Start ----------- 1. **Start the fault manager with snapshot capture enabled:** .. code-block:: bash ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \ -p snapshots.enabled:=true \ -p snapshots.default_topics:="['/odom', '/battery_state']" 2. **Start the gateway:** .. code-block:: bash ros2 launch ros2_medkit_gateway gateway.launch.py 3. **When a fault is confirmed, query its snapshots:** .. code-block:: bash curl http://localhost:8080/api/v1/faults/MOTOR_OVERHEAT/snapshots Configuration Options --------------------- Configure snapshot capture via fault manager parameters: .. list-table:: :widths: 30 15 55 :header-rows: 1 * - Parameter - Default - Description * - ``snapshots.enabled`` - ``true`` - Enable/disable snapshot capture * - ``snapshots.default_topics`` - ``[]`` - Topics to capture for all faults * - ``snapshots.config_file`` - ``""`` - Path to YAML config for fault-specific topics * - ``snapshots.timeout_sec`` - ``1.0`` - Timeout waiting for topic message * - ``snapshots.max_message_size`` - ``65536`` - Maximum message size in bytes (larger messages skipped) * - ``snapshots.background_capture`` - ``false`` - Use background subscriptions (caches latest message) Advanced Configuration ---------------------- For fault-specific topic capture, create a YAML configuration file: .. code-block:: yaml # snapshots.yaml fault_specific: MOTOR_OVERHEAT: - /joint_states - /motor/temperature BATTERY_LOW: - /battery_state - /power_management/status patterns: "MOTOR_.*": - /joint_states - /cmd_vel "SENSOR_.*": - /diagnostics **Topic Resolution Priority:** 1. ``fault_specific`` - Exact match for fault code 2. ``patterns`` - Regex pattern match (first matching pattern wins) 3. ``default_topics`` - Fallback for all faults **Launch with config file:** .. code-block:: bash ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \ -p snapshots.enabled:=true \ -p snapshots.config_file:=/path/to/snapshots.yaml \ -p snapshots.default_topics:="['/diagnostics']" Querying Snapshots ------------------ Snapshots are included inline in the fault response as ``environment_data``: **Get fault details with snapshots:** .. code-block:: bash curl http://localhost:8080/api/v1/apps/motor_controller/faults/MOTOR_OVERHEAT **Response:** .. code-block:: json { "item": { "code": "MOTOR_OVERHEAT", "fault_name": "Motor temperature exceeded threshold", "severity": 2, "status": { "aggregatedStatus": "active", "testFailed": "1", "confirmedDTC": "1" } }, "environment_data": { "extended_data_records": { "first_occurrence": "2026-02-04T10:30:00.000Z", "last_occurrence": "2026-02-04T10:35:00.000Z" }, "snapshots": [ { "type": "freeze_frame", "name": "motor_temperature", "data": 85.5, "x-medkit": { "topic": "/motor/temperature", "message_type": "sensor_msgs/msg/Temperature", "full_data": {"temperature": 85.5, "variance": 0.1}, "captured_at": "2026-02-04T10:30:00.123Z" } }, { "type": "rosbag", "name": "fault_recording", "bulk_data_uri": "/apps/motor_controller/bulk-data/rosbags/550e8400-e29b-41d4-a716-446655440000", "size_bytes": 1234567, "duration_sec": 6.0, "format": "mcap" } ] }, "x-medkit": { "occurrence_count": 3, "reporting_sources": ["/powertrain/motor_controller"] } } **Snapshot Types:** - ``freeze_frame``: Topic data captured at fault confirmation (JSON format) - ``rosbag``: Recording file available via bulk-data endpoint (binary format) **Get snapshots from fault response using jq:** .. code-block:: bash curl http://localhost:8080/api/v1/apps/motor_controller/faults/MOTOR_OVERHEAT | \ jq '.environment_data.snapshots' Example Workflow ---------------- This example demonstrates the complete snapshot capture workflow. **1. Configure and start the fault manager:** .. code-block:: bash ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \ -p snapshots.enabled:=true \ -p snapshots.default_topics:="['/odom']" **2. Start a node that publishes odometry:** .. code-block:: bash ros2 topic pub /odom nav_msgs/msg/Odometry \ "{pose: {pose: {position: {x: 1.5, y: 2.0}}}}" -r 10 **3. Report a fault (it will be confirmed immediately by default):** .. code-block:: bash ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \ "{fault_code: 'NAV_ERROR', event_type: 0, severity: 2, \ description: 'Navigation failed', source_id: '/nav_node'}" **4. Query the captured snapshot:** .. code-block:: bash curl http://localhost:8080/api/v1/apps/nav_node/faults/NAV_ERROR | \ jq '.environment_data.snapshots' The response will contain the odometry data that was captured at the moment the fault was confirmed. Troubleshooting --------------- **No snapshots captured** - Verify ``snapshots.enabled`` is ``true`` - Check that configured topics exist and are publishing - Increase ``snapshots.timeout_sec`` for slow-publishing topics - Check fault manager logs for capture errors **Empty topics object in response** - The fault may have been cleared (snapshots are deleted on clear) - No topics were configured for this fault code - All configured topics timed out or exceeded size limit **Snapshot data truncated** - Message exceeded ``snapshots.max_message_size`` - Increase the limit or filter to smaller topics **Wrong topics captured** - Check topic resolution priority (fault_specific > patterns > default) - Verify regex patterns in config file are correct Rosbag Capture (Time-Window Recording) -------------------------------------- In addition to JSON snapshots, you can enable **rosbag capture** for "black box" style recording. This continuously buffers messages in memory and flushes them to a bag file when a fault is confirmed. **Key differences from JSON snapshots:** .. list-table:: :widths: 25 35 40 :header-rows: 1 * - Feature - JSON Snapshots - Rosbag Capture * - Data format - JSON (human-readable) - Binary (native ROS 2) * - Time coverage - Point-in-time (at confirmation) - Time window (before + after fault) * - Message fidelity - Converted to JSON - Original serialization preserved * - Playback - N/A - ``ros2 bag play`` * - Default - Enabled - Disabled Enabling Rosbag Capture ^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \ -p snapshots.rosbag.enabled:=true \ -p snapshots.rosbag.duration_sec:=5.0 \ -p snapshots.rosbag.duration_after_sec:=1.0 This captures 5 seconds of data **before** the fault and 1 second **after**. Rosbag Configuration Options ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 35 15 50 :header-rows: 1 * - Parameter - Default - Description * - ``snapshots.rosbag.enabled`` - ``false`` - Enable rosbag capture. When enabled, the system continuously buffers messages in memory and writes them to a bag file when faults are confirmed. * - ``snapshots.rosbag.duration_sec`` - ``5.0`` - Ring buffer duration in seconds. This determines how much history is preserved before the fault confirmation. Larger values provide more context but consume more memory. * - ``snapshots.rosbag.duration_after_sec`` - ``1.0`` - Post-fault recording duration. After a fault is confirmed, recording continues for this many seconds to capture immediate system response. * - ``snapshots.rosbag.topics`` - ``"config"`` - Topic selection mode: - ``"config"`` - Use same topics as JSON snapshots (from config file) - ``"all"`` or ``"auto"`` - Auto-discover and record all available topics - ``"explicit"`` - Use only topics from ``include_topics`` list * - ``snapshots.rosbag.include_topics`` - ``[]`` - Explicit list of topics to record (only used when ``topics: "explicit"``). Example: ``["/odom", "/joint_states", "/cmd_vel"]`` * - ``snapshots.rosbag.exclude_topics`` - ``[]`` - Topics to exclude from recording (applies to all modes). Useful for filtering high-bandwidth topics like camera images. * - ``snapshots.rosbag.format`` - ``"sqlite3"`` - Bag storage format: ``"sqlite3"`` (default, widely compatible) or ``"mcap"`` (more efficient compression, requires plugin). * - ``snapshots.rosbag.storage_path`` - ``""`` - Directory for bag files. Empty string uses system temp directory (``/tmp``). Bags are named ``fault_{code}_{timestamp}/``. * - ``snapshots.rosbag.auto_cleanup`` - ``true`` - Automatically delete bag files when faults are cleared. Set to ``false`` to retain bags for manual analysis. * - ``snapshots.rosbag.lazy_start`` - ``false`` - Controls when the ring buffer starts recording. See diagram below. * - ``snapshots.rosbag.max_bag_size_mb`` - ``50`` - Maximum size per bag file in MB. When exceeded, rosbag2 creates additional segment files. * - ``snapshots.rosbag.max_total_storage_mb`` - ``500`` - Total storage limit for all bag files. Oldest bags are automatically deleted when this limit is exceeded. Understanding lazy_start Mode ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``lazy_start`` parameter controls when the ring buffer starts recording: **lazy_start: false (default)** - Recording starts immediately at node startup. Best for development and when you need maximum context for any fault. .. uml:: @startuml skinparam backgroundColor transparent participant "Ring Buffer" as RB participant "Fault Manager" as FM == Node Startup == RB -> RB : Start recording note right of RB #LightGreen : Buffer active\n(continuous) ... time passes (messages buffered) ... == Fault Detected == FM -> FM : PREFAILED note right of RB #LightGreen : duration_sec\nof data buffered FM -> FM : CONFIRMED FM -> RB : Flush buffer RB -> RB : Write pre-fault data note right of RB #LightBlue : Post-fault\nrecording ... duration_after_sec ... RB -> RB : Save bag file note right of RB #LightGreen : Resume\nbuffering @enduml **lazy_start: true** - Recording only starts when a fault enters PREFAILED state. Saves resources but may miss context if fault confirms before buffer fills. .. uml:: @startuml skinparam backgroundColor transparent participant "Ring Buffer" as RB participant "Fault Manager" as FM == Node Startup == note right of RB #LightGray : Buffer inactive\n(saving resources) ... time passes (no recording) ... == Fault Detected == FM -> FM : PREFAILED FM -> RB : Start buffer note right of RB #LightGreen : Recording\nstarts now ... buffer filling (may be < duration_sec) ... FM -> FM : CONFIRMED FM -> RB : Flush buffer RB -> RB : Write pre-fault data note right of RB #LightBlue : Post-fault\nrecording ... duration_after_sec ... RB -> RB : Save bag file note right of RB #LightGray : Buffer\ninactive @enduml **When to use lazy_start: true:** - Production systems with limited resources - When faults have reliable PREFAILED → CONFIRMED progression - Systems where most faults are debounced (enter PREFAILED first) **When to use lazy_start: false:** - Development and debugging - When faults may skip PREFAILED state (severity 3 = CRITICAL) - When maximum fault context is more important than resource usage .. note:: The ``"mcap"`` format requires ``rosbag2_storage_mcap`` to be installed. If not available, use ``"sqlite3"`` (default). .. code-block:: bash # Install MCAP support (optional) sudo apt install ros-${ROS_DISTRO}-rosbag2-storage-mcap Downloading Rosbag Files ^^^^^^^^^^^^^^^^^^^^^^^^ Rosbag files are downloaded via SOVD bulk-data endpoints. **1. List available rosbags for an entity:** .. code-block:: bash curl http://localhost:8080/api/v1/apps/motor_controller/bulk-data/rosbags **Response:** .. code-block:: json { "items": [ { "id": "550e8400-e29b-41d4-a716-446655440000", "name": "MOTOR_OVERHEAT recording", "mimetype": "application/x-mcap", "size": 1234567, "creation_date": "2026-02-04T10:30:00.000Z", "x-medkit": { "fault_code": "MOTOR_OVERHEAT", "duration_sec": 6.0, "format": "mcap" } } ] } **2. Download a specific rosbag:** Use the ``bulk_data_uri`` from the fault response, or construct from listing: .. code-block:: bash # Using bulk_data_uri from fault response curl -O -J http://localhost:8080/api/v1/apps/motor_controller/bulk-data/rosbags/550e8400-e29b-41d4-a716-446655440000 The ``-J`` flag uses the server-provided filename from ``Content-Disposition`` header. **3. Play back the rosbag:** .. code-block:: bash ros2 bag play MOTOR_OVERHEAT.mcap **Via ROS 2 service (alternative):** .. code-block:: bash ros2 service call /fault_manager/get_rosbag ros2_medkit_msgs/srv/GetRosbag \ "{fault_code: 'MOTOR_OVERHEAT'}" Example: Production Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For production use with conservative resource usage: .. code-block:: yaml # config/snapshots.yaml rosbag: enabled: true duration_sec: 3.0 duration_after_sec: 0.5 topics: "config" # Use same topics as JSON snapshots lazy_start: true # Save resources until fault detected format: "sqlite3" max_bag_size_mb: 25 max_total_storage_mb: 200 auto_cleanup: true # Exclude high-bandwidth topics # exclude_topics: # - /camera/image_raw # - /pointcloud Example: Debugging Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For development with maximum context: .. code-block:: yaml rosbag: enabled: true duration_sec: 10.0 # 10 seconds before fault duration_after_sec: 2.0 # 2 seconds after topics: "config" lazy_start: false # Always recording format: "sqlite3" storage_path: "/var/log/ros2_medkit/rosbags" max_bag_size_mb: 100 max_total_storage_mb: 1000 auto_cleanup: false # Keep bags for analysis See Also -------- - :doc:`../api/rest` - REST API reference (Bulk Data section) - :doc:`../requirements/specs/faults` - Fault API requirements - `Gateway README `_ - REST API reference - `config/snapshots.yaml `_ - Full configuration reference Migration from Legacy Endpoints ------------------------------- If you were using the legacy snapshot endpoints, migrate to the new SOVD-compliant API: **Snapshots:** .. list-table:: :widths: 45 55 :header-rows: 1 * - Previous (removed) - Current * - ``GET /faults/{code}/snapshots`` - ``GET /apps/{app}/faults/{code}`` → ``environment_data.snapshots[]`` * - ``GET /components/{id}/faults/{code}/snapshots`` - ``GET /components/{id}/faults/{code}`` → ``environment_data.snapshots[]`` **Rosbag Downloads:** .. list-table:: :widths: 45 55 :header-rows: 1 * - Previous (removed) - Current * - ``GET /faults/{code}/snapshots/bag`` - ``GET /apps/{app}/bulk-data/rosbags/{id}`` * - ``GET /components/{id}/faults/{code}/snapshots/bag`` - ``GET /components/{id}/bulk-data/rosbags/{id}`` **Key Changes:** 1. **Snapshots inline**: No separate snapshot endpoint; data is in fault response 2. **Bulk-data pattern**: Rosbags use SOVD bulk-data with UUID identifiers 3. **Entity-scoped**: Bulk-data endpoints require entity path (e.g., ``/apps/motor``) 4. **SOVD status**: Fault response includes SOVD-compliant ``status`` object