Entity Cache Architecture ========================= Background ---------- The gateway's ``ThreadSafeEntityCache`` holds the live SOVD entity tree (Areas, Components, Apps, Functions) and is read by every HTTP handler thread. Prior to issue #442 the cache stored entities in ``std::unordered_map`` containers that reallocated on insert and shifted slot addresses on erase. Discovery refreshes - which run every 100 ms on graph-event detection - caused structural allocations even when nothing had changed, adding heap churn and making the cache unsuitable for memory-constrained or embedded deployments. This document describes the replacement architecture introduced in #442. Goals ----- 1. Eliminate per-refresh structural allocations in the cache layer for a stable entity graph. 2. Make heap use predictable: capacity is fixed at startup, not open-ended. 3. Gate the generation counter on actual changes so polling it on a quiet graph is cheap. 4. Preserve the existing public API and SOVD REST contract - the changes are entirely below the handler layer. Object Pool and Flat Index Architecture --------------------------------------- The cache uses two building blocks. **SlotStore** - a stable-slot object pool. Each entity type (Area, Component, App, Function) gets its own ``SlotStore``. The store holds entities in a flat ``std::vector`` that is pre-reserved to ``capacity`` slots at construction. Slots are never moved or shifted after allocation; a freed slot is marked available and reused for the next insert. Slot IDs are stable for the lifetime of an entity. The store exposes typed references so callers do not hold raw pointers across mutations. **FlatHashMap** - a linear-probe open-address hash map pre-reserved to the same capacity at construction. Used for all index maps: - ``area_index_``, ``component_index_``, ``app_index_``, ``function_index_`` - map entity ID to slot ID (O(1) lookup) - ``component_to_apps_``, ``area_to_components_``, etc. - relationship indexes (parent ID to child slot ID list) - ``operation_index_`` - operation full_path to owning entity reference - ``topic_type_cache_`` - topic name to message type All these maps are reserved once at ``ThreadSafeEntityCache`` construction and do not rehash as long as the live entity count stays within ``capacity``. The ``capacity`` value comes from the ``entity_cache.capacity`` ROS parameter (default: 256, valid range: 16-1000000; values outside this range are clamped with a warning at startup). Incremental Reconcile --------------------- ``update_all`` (and the per-type variants ``update_areas``, ``update_components``, ``update_apps``, ``update_functions``) implement an in-place diff instead of a clear-and-rebuild: 1. **Remove** slots whose IDs are absent from the incoming list. 2. **Add** incoming IDs that are not yet in the index (allocate a new slot from the pool). 3. **Change** slots where an existing entity's payload differs from the incoming value (in-place overwrite, no slot movement). Only mutations that fall into categories 1, 2, or 3 advance the generation counter. A no-op call (incoming set is identical to the current cache) leaves the generation unchanged. This allows consumers such as the OpenAPI capability generator to skip expensive regeneration when the graph has not changed. Zero-Structural-Allocation Property ------------------------------------ Once ``capacity`` slots are reserved and the entity count stabilises below that threshold, steady-state ``update_all`` calls perform zero structural allocations in the cache layer: no ``std::vector`` resize, no hash map rehash, no ``new``/``delete`` for index entries. In-place slot overwrites update entity payloads without touching the pool structure. .. note:: **Honest boundary.** Entity payloads themselves - ``nlohmann::json`` objects (e.g., ``host_metadata``, ``type_info``) and ``std::vector`` fields (topic lists, service lists) - still allocate on the heap inside the discovery layer before the cache sees them. Making those payloads fixed-capacity (e.g., custom allocators or pre-sized arenas) is a separate future effort and is explicitly out of scope for this change. The zero-structural-allocation guarantee applies to the cache data structures, not to the payloads stored in them. Discovery-Side Allocation Under Graph Churn ------------------------------------------- The cache is only the last step of ``refresh_cache()``. Per-graph-event allocation profiling showed the dominant churn under a churning graph was in the discovery pipeline, not the cache rebuild, driven by two things: 1. **Eager schema building.** ``discover_apps`` rebuilt the JSON request/ response (and goal/result/feedback) schemas for every service and action in the graph on every refresh, then deep-copied them through several intermediate containers. These schemas are immutable per type and are not read from the discovered entities at all - the ``/operations`` handler already resolves them on demand. Discovery now leaves ``type_info`` empty; the handler resolves it lazily and ``TypeIntrospection`` caches the assembled per-type ``type_info`` as a shared, immutable object (``get_service_type_info`` / ``get_action_type_info``), so repeated requests reuse it. 2. **Refresh frequency.** The rclcpp graph event fires many times per second under churn, running the full discovery pipeline each time. Graph events are now debounced (``discovery.refresh_debounce_ms``, default 1 s) so a burst of graph changes coalesces into a single refresh. Overflow Behaviour ------------------ If ``update_all`` receives more entities than the reserved ``capacity``, the pool and maps grow dynamically (the standard resize path). A one-shot WARN log fires when this happens: .. code-block:: text [WARN] entity_cache capacity 256 exceeded (grew); raise entity_cache.capacity for embedded determinism The gateway continues operating normally. The WARN fires once per process lifetime to avoid log spam. Generation Counter ------------------ ``ThreadSafeEntityCache::generation()`` returns a ``uint64_t`` that increments on each mutation that actually changes the cache contents. Callers use it as a change token: .. code-block:: cpp auto gen = cache.generation(); // ... later ... if (cache.generation() != gen) { rebuild_openapi_spec(); } Because the counter is gated on real changes, polling it on a stable graph costs nothing beyond an atomic load. Health Endpoint Stats --------------------- ``GET /api/v1/health`` exposes entity cache statistics under the ``x-medkit-entity-cache`` vendor-extension key: .. code-block:: json { "status": "healthy", "x-medkit-entity-cache": { "capacity": 256, "areas": 2, "components": 1, "apps": 14, "functions": 3, "generation": 7, "grew": false } } Configuration ------------- The cache capacity is set via ``entity_cache.capacity`` in ``gateway_params.yaml`` (or any ROS parameter source). See :doc:`/config/server` (Performance Tuning section) for the full parameter reference. See Also -------- - :doc:`ros2_subscription_architecture` - Analogous embedded-hardened design for the topic subscription pool - :doc:`/config/server` - ``entity_cache.capacity`` parameter reference